Run H2O From Within R

BLOG

With the REST API, it’s simple to run H2O operations from within R using similar syntax to all your favorite R functions. In this post, we’ll walk through a simple demo of its capabilities. First, get H2O installed and running by following the tutorial here . Once you have the R package loaded, you can take a look at the included demos by typing demo(package=”h2o”) , and run one of them by typing e.g., demo(h2o.glm) . We’ll be stepping through a few of the basic statistical functions in this tutorial.

Starting up H2O in R

library(h2o)
localH2O = new("H2OClient", ip = "127.0.0.1", port = 54321)
h2o.checkClient(localH2O)

The beginning of each H2O R script looks the same – first, load the R package, then create an H2OClient object containing the IP and port at which H2O resides. If you are running H2O on your local machine, the default is IP = 127.0.0.1 and port = 54321. You can call h2o.checkClient to check if H2O is connectable. Once that’s done, we are ready to work with some data!
Importing and Summarizing Data
In this tutorial, we will be working with the prostate cancer data set, which comes from a study by Dr. Donn Young at The Ohio State University Comprehensive Cancer Center of patients with varying degrees of prostate cancer. The relevant columns are CAPSULE (binary variable indicating tumor penetration of prostatic capsule), AGE (in years), RACE (1 = white, 2 = black), PSA (prostatic-specific antigen value), and GLEASON (total gleason score, indicating how aggressive the cancer is). See Applied Logistic Regression by Hosmer and Lemeshow (2000) for more details.

prostate.data = h2o.importURL(localH2O, path = "https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv", key = "prostate.hex")
summary(prostate.data)

The first line imports and parses the data set prostate.csv from the given URL, storing it in H2O under a unique identifier (hex key), prostate.hex. The method h2o.importURL returns an H2OParsedData object containing the IP and port on which H2O resides, as well as the data set’s hex key, which we save to prostate.data. All references within R to this data set will now be through prostate.data. Hence, if we wanted to get summary statistics, we’d call summary(prostate.data) . This displays the minimum, maximum, median, mean, and quantiles of each column of the data set, just like in R (only relevant columns are shown below):

Min. :0.000	Min. :43.000	Min. :0.000	Min. : 0.300	Min. :0.000
1st Qu.:0.000	1st Qu.:62.000	1st Qu.:1.000	1st Qu.: 5.132	1st Qu.:6.000
Median :0.000	Median :67.000	Median :1.000	Median : 5.132	Median :6.000
Mean :0.403	Mean :66.039	Mean :1.087	Mean : 15.409	Mean :6.384
3rd Qu.:1.000	3rd Qu.:71.000	3rd Qu.:1.000	3rd Qu.: 14.795	3rd Qu.:7.000
Max. :1.000	Max. :79.000	Max. :2.000	Max. : 139.700	Max. :9.000

Running GLM (Generalized Linear Model)
Now that we have a sense of the data set’s structure, we will want to run a statistical analysis on it. Let’s try to run a logistic regression , with CAPSULE as the response and AGE, RACE, PSA and GLEASON as the predictors. The GLM family in this case is binomial with default link function logit. (See Wiki for a more detailed mathematical explanation). In essence, we are studying how the probability of capsular involvement is affected by a patient’s age, race, PSA and gleason score.

h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","GLEASON"), data = prostate.data, family = "binomial", nfolds = 10, alpha = 0.5)

You should get as your result the following coefficients:

Coefficients:

-0.02119	-0.46410	0.02804	1.07613	-5.86616

Degrees of Freedom: 379 Total (i.e. Null);  374 Residual
Null Deviance:     512.3
Residual Deviance: 416.3
AIC: 426.3

Looking at the coefficients, we see that the log-odds (and by extension, probability) of prostate capsular penetration increase with PSA and gleason score, as expected, but decrease slightly with age. A patient who is black is significantly less likely to exhibit capsular involvement than one who is white, although it is unknown whether this is a direct effect, or whether race is capturing some other characteristic excluded from the regression .

Running K-Means Clustering

Now, let’s run the k-means algorithm to identify how similar patients should be clustered. (See Wiki for a description of the mathematics). We start with k = 5 clusters, using only the predictors AGE, RACE, GLEASON, CAPSULE and PSA for categorization:

prostate.km = h2o.kmeans(data = prostate.data, centers = 5, cols = c("AGE","RACE","GLEASON","CAPSULE","PSA"))
print(prostate.km)

K-means clustering with 5 clusters of sizes 278, 4, 23, 69, 6

Cluster means:

1 66.14947	1.071174	6.124555	0.3060498	7.107402
2 65.75000	1.250000	8.000000	1.0000000	131.175000
3 66.09091	1.227273	7.136364	0.7272727	55.213636
4 65.44776	1.089552	7.014925	0.6119403	23.876119
5 67.50000	1.166667	7.666667	1.0000000	86.500000

From our results, we see that 278 patients – the large majority – are in category 1, with age close to 66 years and only about 30% exhibiting capsular penetration. The PSA and gleason score of this cluster are by far the lowest. In contrast, category 2 is the smallest cluster, with only 4 patients, but they all show capsular penetration and, as expected, far higher gleason scores and PSA. Clearly, k-means was correct in categorizing these patients into separate groups.
While this tutorial used a relatively small data set, H2O gives you the ability to manipulate huge amounts of data that conventional R can’t handle. With the H2O R package installed, you can treat them like any other data set in R, and H2O will do the heavy lifting in the background for you. Try it out yourself!

R Project for Statistical Computing http://www.r-project.org/
H2O Official Documentation http://docs.0xdata.com/
H2O R Quick-Start Guide http://docs.0xdata.com/Ruser/Rinstall.html
H2O R Package Documentation http://docs.0xdata.com/bits/h2o_package.pdf

Explore similar content by topic

H2O.ai Team

At H2O.ai, democratizing AI isn’t just an idea. It’s a movement. And that means that it requires action. We started out as a group of like minded individuals in the open source community, collectively driven by the idea that there should be freedom around the creation and use of AI.

Today we have evolved into a global company built by people from a variety of different backgrounds and skill sets, all driven to be part of something greater than ourselves. Our partnerships now extend beyond the open-source community to include business customers, academia, and non-profit organizations.

BLOG

Run H2O From Within R

Starting up H2O in R

Coefficients:

Running K-Means Clustering

Cluster means:

References

Explore similar content by topic

H2O.ai Team

Ready to see the H2O.ai platform in action?

Why H2O.ai

Products

Resources

Insights

FOR MODEL BUILDERS

FOR DATA SCIENTISTS

FOR ENTERPRISE DEVELOPERS

BLOG

Run H2O From Within R

Starting up H2O in R

Coefficients:

Running K-Means Clustering

Cluster means:

References

Explore similar content by topic

H2O.ai Team

Ready to see the H2O.ai platform in action?

Why H2O.ai

Products

Resources

Insights