August 13th, 2013

Run H2O From Within R

RSS icon RSS Category: Uncategorized [EN]
Fallback Featured Image

With the REST API, it’s simple to run H2O operations from within R using similar syntax to all your favorite R functions. In this post, we’ll walk through a simple demo of its capabilities. First, get H2O installed and running by following the tutorial here. Once you have the R package loaded, you can take a look at the included demos by typing demo(package=”h2o”), and run one of them by typing e.g., demo(h2o.glm). We’ll be stepping through a few of the basic statistical functions in this tutorial.

Starting up H2O in R

library(h2o)
localH2O = new("H2OClient", ip = "127.0.0.1", port = 54321)
h2o.checkClient(localH2O)

The beginning of each H2O R script looks the same – first, load the R package, then create an H2OClient object containing the IP and port at which H2O resides. If you are running H2O on your local machine, the default is IP = 127.0.0.1 and port = 54321. You can call h2o.checkClient to check if H2O is connectable. Once that’s done, we are ready to work with some data!
Importing and Summarizing Data
In this tutorial, we will be working with the prostate cancer data set, which comes from a study by Dr. Donn Young at The Ohio State University Comprehensive Cancer Center of patients with varying degrees of prostate cancer. The relevant columns are CAPSULE (binary variable indicating tumor penetration of prostatic capsule), AGE (in years), RACE (1 = white, 2 = black), PSA (prostatic-specific antigen value), and GLEASON (total gleason score, indicating how aggressive the cancer is). See Applied Logistic Regression by Hosmer and Lemeshow (2000) for more details.

prostate.data = h2o.importURL(localH2O, path = "https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv", key = "prostate.hex")
summary(prostate.data)

The first line imports and parses the data set prostate.csv from the given URL, storing it in H2O under a unique identifier (hex key), prostate.hex. The method h2o.importURL returns an H2OParsedData object containing the IP and port on which H2O resides, as well as the data set’s hex key, which we save to prostate.data. All references within R to this data set will now be through prostate.data. Hence, if we wanted to get summary statistics, we’d call summary(prostate.data). This displays the minimum, maximum, median, mean, and quantiles of each column of the data set, just like in R (only relevant columns are shown below):

CAPSULE AGE RACE PSA GLEASON
 Min.   :0.000 Min.   :43.000 Min. :0.000 Min.   : 0.300 Min.   :0.000
 1st Qu.:0.000 1st Qu.:62.000 1st Qu.:1.000 1st Qu.: 5.132 1st Qu.:6.000
 Median :0.000 Median :67.000 Median :1.000 Median : 5.132 Median :6.000
 Mean   :0.403 Mean   :66.039 Mean :1.087 Mean   : 15.409 Mean   :6.384
 3rd Qu.:1.000 3rd Qu.:71.000 3rd Qu.:1.000 3rd Qu.: 14.795  3rd Qu.:7.000
 Max.   :1.000 Max.   :79.000 Max. :2.000 Max.   : 139.700 Max.   :9.000

Running GLM (Generalized Linear Model)
Now that we have a sense of the data set’s structure, we will want to run a statistical analysis on it. Let’s try to run a logistic regression, with CAPSULE as the response and AGE, RACE, PSA and GLEASON as the predictors. The GLM family in this case is binomial with default link function logit. (See Wiki for a more detailed mathematical explanation). In essence, we are studying how the probability of capsular involvement is affected by a patient’s age, race, PSA and gleason score.

h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","GLEASON"), data = prostate.data, family = "binomial", nfolds = 10, alpha = 0.5)

You should get as your result the following coefficients:

Coefficients:

AGE  RACE PSA GLEASON Intercept
-0.02119  -0.46410 0.02804 1.07613 -5.86616
Degrees of Freedom: 379 Total (i.e. Null);  374 Residual
Null Deviance:     512.3
Residual Deviance: 416.3
AIC: 426.3

Looking at the coefficients, we see that the log-odds (and by extension, probability) of prostate capsular penetration increase with PSA and gleason score, as expected, but decrease slightly with age. A patient who is black is significantly less likely to exhibit capsular involvement than one who is white, although it is unknown whether this is a direct effect, or whether race is capturing some other characteristic excluded from the regression.

Running K-Means Clustering

Now, let’s run the k-means algorithm to identify how similar patients should be clustered. (See Wiki for a description of the mathematics). We start with k = 5 clusters, using only the predictors AGE, RACE, GLEASON, CAPSULE and PSA for categorization:

prostate.km = h2o.kmeans(data = prostate.data, centers = 5, cols = c("AGE","RACE","GLEASON","CAPSULE","PSA"))
print(prostate.km)

K-means clustering with 5 clusters of sizes 278, 4, 23, 69, 6

Cluster means:

AGE RACE  GLEASON CAPSULE  PSA
1 66.14947 1.071174 6.124555 0.3060498 7.107402
2 65.75000 1.250000 8.000000 1.0000000 131.175000
3 66.09091 1.227273 7.136364 0.7272727 55.213636
4 65.44776 1.089552 7.014925 0.6119403 23.876119
5 67.50000 1.166667 7.666667 1.0000000 86.500000

From our results, we see that 278 patients – the large majority – are in category 1, with age close to 66 years and only about 30% exhibiting capsular penetration. The PSA and gleason score of this cluster are by far the lowest. In contrast, category 2 is the smallest cluster, with only 4 patients, but they all show capsular penetration and, as expected, far higher gleason scores and PSA. Clearly, k-means was correct in categorizing these patients into separate groups.
While this tutorial used a relatively small data set, H2O gives you the ability to manipulate huge amounts of data that conventional R can’t handle. With the H2O R package installed, you can treat them like any other data set in R, and H2O will do the heavy lifting in the background for you. Try it out yourself!

References

Leave a Reply

+
Recap of H2O World India 2023: Advancements in AI and Insights from Industry Leaders

On April 19th, the H2O World  made its debut in India, marking yet another milestone

May 29, 2023 - by Parul Pandey
+
Enhancing H2O Model Validation App with h2oGPT Integration

As machine learning practitioners, we’re always on the lookout for innovative ways to streamline and

May 17, 2023 - by Parul Pandey
+
Building a Manufacturing Product Defect Classification Model and Application using H2O Hydrogen Torch, H2O MLOps, and H2O Wave

Primary Authors: Nishaanthini Gnanavel and Genevieve Richards Effective product quality control is of utmost importance in

May 15, 2023 - by Shivam Bansal
AI for Good hackathon
+
Insights from AI for Good Hackathon: Using Machine Learning to Tackle Pollution

At H2O.ai, we believe technology can be a force for good, and we're committed to

May 10, 2023 - by Parul Pandey and Shivam Bansal
H2O democratizing LLMs
+
Democratization of LLMs

Every organization needs to own its GPT as simply as we need to own our

May 8, 2023 - by Sri Ambati
h2oGPT blog header
+
Building the World’s Best Open-Source Large Language Model: H2O.ai’s Journey

At H2O.ai, we pride ourselves on developing world-class Machine Learning, Deep Learning, and AI platforms.

May 3, 2023 - by Arno Candel

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More