August 13th, 2013

Run H2O From Within R

RSS icon RSS Category: Uncategorized [EN]
Fallback Featured Image

With the REST API, it’s simple to run H2O operations from within R using similar syntax to all your favorite R functions. In this post, we’ll walk through a simple demo of its capabilities. First, get H2O installed and running by following the tutorial here. Once you have the R package loaded, you can take a look at the included demos by typing demo(package=”h2o”), and run one of them by typing e.g., demo(h2o.glm). We’ll be stepping through a few of the basic statistical functions in this tutorial.

Starting up H2O in R

library(h2o)
localH2O = new("H2OClient", ip = "127.0.0.1", port = 54321)
h2o.checkClient(localH2O)

The beginning of each H2O R script looks the same – first, load the R package, then create an H2OClient object containing the IP and port at which H2O resides. If you are running H2O on your local machine, the default is IP = 127.0.0.1 and port = 54321. You can call h2o.checkClient to check if H2O is connectable. Once that’s done, we are ready to work with some data!
Importing and Summarizing Data
In this tutorial, we will be working with the prostate cancer data set, which comes from a study by Dr. Donn Young at The Ohio State University Comprehensive Cancer Center of patients with varying degrees of prostate cancer. The relevant columns are CAPSULE (binary variable indicating tumor penetration of prostatic capsule), AGE (in years), RACE (1 = white, 2 = black), PSA (prostatic-specific antigen value), and GLEASON (total gleason score, indicating how aggressive the cancer is). See Applied Logistic Regression by Hosmer and Lemeshow (2000) for more details.

prostate.data = h2o.importURL(localH2O, path = "https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv", key = "prostate.hex")
summary(prostate.data)

The first line imports and parses the data set prostate.csv from the given URL, storing it in H2O under a unique identifier (hex key), prostate.hex. The method h2o.importURL returns an H2OParsedData object containing the IP and port on which H2O resides, as well as the data set’s hex key, which we save to prostate.data. All references within R to this data set will now be through prostate.data. Hence, if we wanted to get summary statistics, we’d call summary(prostate.data). This displays the minimum, maximum, median, mean, and quantiles of each column of the data set, just like in R (only relevant columns are shown below):

CAPSULE AGE RACE PSA GLEASON
 Min.   :0.000 Min.   :43.000 Min. :0.000 Min.   : 0.300 Min.   :0.000
 1st Qu.:0.000 1st Qu.:62.000 1st Qu.:1.000 1st Qu.: 5.132 1st Qu.:6.000
 Median :0.000 Median :67.000 Median :1.000 Median : 5.132 Median :6.000
 Mean   :0.403 Mean   :66.039 Mean :1.087 Mean   : 15.409 Mean   :6.384
 3rd Qu.:1.000 3rd Qu.:71.000 3rd Qu.:1.000 3rd Qu.: 14.795  3rd Qu.:7.000
 Max.   :1.000 Max.   :79.000 Max. :2.000 Max.   : 139.700 Max.   :9.000

Running GLM (Generalized Linear Model)
Now that we have a sense of the data set’s structure, we will want to run a statistical analysis on it. Let’s try to run a logistic regression, with CAPSULE as the response and AGE, RACE, PSA and GLEASON as the predictors. The GLM family in this case is binomial with default link function logit. (See Wiki for a more detailed mathematical explanation). In essence, we are studying how the probability of capsular involvement is affected by a patient’s age, race, PSA and gleason score.

h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","GLEASON"), data = prostate.data, family = "binomial", nfolds = 10, alpha = 0.5)

You should get as your result the following coefficients:

Coefficients:

AGE  RACE PSA GLEASON Intercept
-0.02119  -0.46410 0.02804 1.07613 -5.86616
Degrees of Freedom: 379 Total (i.e. Null);  374 Residual
Null Deviance:     512.3
Residual Deviance: 416.3
AIC: 426.3

Looking at the coefficients, we see that the log-odds (and by extension, probability) of prostate capsular penetration increase with PSA and gleason score, as expected, but decrease slightly with age. A patient who is black is significantly less likely to exhibit capsular involvement than one who is white, although it is unknown whether this is a direct effect, or whether race is capturing some other characteristic excluded from the regression.

Running K-Means Clustering

Now, let’s run the k-means algorithm to identify how similar patients should be clustered. (See Wiki for a description of the mathematics). We start with k = 5 clusters, using only the predictors AGE, RACE, GLEASON, CAPSULE and PSA for categorization:

prostate.km = h2o.kmeans(data = prostate.data, centers = 5, cols = c("AGE","RACE","GLEASON","CAPSULE","PSA"))
print(prostate.km)

K-means clustering with 5 clusters of sizes 278, 4, 23, 69, 6

Cluster means:

AGE RACE  GLEASON CAPSULE  PSA
1 66.14947 1.071174 6.124555 0.3060498 7.107402
2 65.75000 1.250000 8.000000 1.0000000 131.175000
3 66.09091 1.227273 7.136364 0.7272727 55.213636
4 65.44776 1.089552 7.014925 0.6119403 23.876119
5 67.50000 1.166667 7.666667 1.0000000 86.500000

From our results, we see that 278 patients – the large majority – are in category 1, with age close to 66 years and only about 30% exhibiting capsular penetration. The PSA and gleason score of this cluster are by far the lowest. In contrast, category 2 is the smallest cluster, with only 4 patients, but they all show capsular penetration and, as expected, far higher gleason scores and PSA. Clearly, k-means was correct in categorizing these patients into separate groups.
While this tutorial used a relatively small data set, H2O gives you the ability to manipulate huge amounts of data that conventional R can’t handle. With the H2O R package installed, you can treat them like any other data set in R, and H2O will do the heavy lifting in the background for you. Try it out yourself!

References

Leave a Reply

+
Three Keys to Ethical Artificial Intelligence in Your Organization

There’s certainly been no shortage of examples of AI gone bad over the past few

September 23, 2022 - by H2O.ai Team
+
Using GraphQL, HTTPX, and asyncio in H2O Wave

Today, I would like to cover the most basic use case for H2O Wave, which is

September 21, 2022 - by Martin Turoci
+
머신러닝 자동화 솔루션 H2O Driveless AI를 이용한 뇌에서의 성차 예측

Predicting Gender Differences in the Brain Using Machine Learning Automation Solution H2O Driverless AI 아동기 뇌인지

August 29, 2022 - by H2O.ai Team
+
Make with H2O.ai Recap: Validation Scheme Best Practices

Data Scientist and Kaggle Grandmaster, Dmitry Gordeev, presented at the Make with H2O.ai session on

August 23, 2022 - by Blair Averett
+
Integrating VSCode editor into H2O Wave

Let’s have a look at how to provide our users with a truly amazing experience

August 18, 2022 - by Martin Turoci
+
5 Tips for Improving Your Wave Apps

Let’s quickly uncover a few simple tips that are quick to implement and have a

August 9, 2022 - by Martin Turoci

Start Your Free Trial