June 10th, 2015

Scaling R with H2O

RSS icon RSS Category: Uncategorized [EN]

In the advent of H2O 3.0 it seems appropriately timed to reintroduce the R API for H2O to help users better understand the differences between R dataframes and H2OFrames. Typically some of the first questions we get include:

  • Does H2O support all R packages and functions?
  • Is H2OFrame an extension of data.frame?
  • Are H2O supported algorithms written on top of preexisting packages in R like glmnet?

Reading in Data

S4 object H2OFrame is a tabular representation of data that has been imported and parsed into H2O’s Distributed Key-Value Store. The object holds information about where the H2O cluster is : conn, what the frame is called on the cluster : frame_id, and the first 10 rows used when a print of the frame is called. When you use the H2OFrame object in a supported function, R simply acts as glue language that allows you to write R code that really makes a call to the Java backend for the expression(s) to be computed.
For example, the user can execute a h2o.importFile command to access a remote H2O cluster and specify the path to import data from. This command gets sent over H2O’s REST API to hit the endpoint on the Java side. Once the import is completed, H2O will return a JSON response that gets summarized into the components described earlier. So most importantly the difference between an R data frame is that it sits in-memory in R while a H2OFrame is just a reference to an object in the DKV.

Summarizing New Frame

Similarly once the user have a data frame and want to summarize it, execute the h2o.summary or summary function. The command will make a call to H2O to execute the MR task which returns a JSON response that is parsed into a table object in R. The input and actual execution for H2O’s summary function is different than the base summary function but the output is still a table object that generic summary returns.

Supported R functions

For a list of all the functions you can apply on the H2OFrame you can bring up the package documentation in the R console by executing ?h2o which will bring up all supported functions.
In short all H2O functions are prefixed with h2o so the user understand that though there are similarities between H2O’s syntax and base R’s syntax, they are essentially different functions.
All of H2O’s algorithms are executed in memory as java tasks so there is no work done in R. When the user calls glmnet, glm, or h2o.glm you are accessing different implementations of GLM. There are however functions that has been overloaded as methods for H2OFrames such as h2o.summary which is also accessible as summary. Unary and binary operation on the frame would not have the h2o prefix as well and you can access a list of these supported operators by executing ?H2OFrame-class in the R console.
Below are some examples of parity of base R functions and H2O specific R functions. The package was written so that there is an equivalence of typical R operations or expressions that can be sent to be computed on the Java side for the most frequently used R functions.

Examples of Transformation of Data Frame

For an example of how you might perform some simple transformations on your H2OFrame please download a small airlines data and run through the following example:
To start import the data and run the summary function.

## Load library and initialize h2o
conn <- h2o.init(nthreads = -1)
pathToAirlines <- normalizePath("~/Downloads/allyears2k.csv")
airlines.hex <- h2o.importFile(conn = conn, path = pathToAirlines, destination_frame = "airlines.hex")
## Summary stats, histogram plots

Then we want to create a feature indicating how long each trip took. To calculate the trip duration just take the difference between DepTime and ArrTime which were parsed as numerics. So to both we’ll extract the hour and minute and convert it to total time (in minutes) elapsed since 12:00AM. Finally take the difference between arrival and departure time and append it to the airlines frame. The beauty of the h2o package is how easy it is to translate a R code to a H2O+R code. The following parameter creation are the exact commands you would run if you have a R data.frame that was parsed in using read.csv.

## Create trip_duration feature
hour1 <- airlines.hex$ArrTime %/% 100
mins1 <- airlines.hex$ArrTime %% 100
arrTime <- hour1*60+mins1
hour2 <- airlines.hex$DepTime %/% 100
mins2 <- airlines.hex$DepTime %% 100
depTime <- hour2*60+mins2
## Take the difference between the two times and assign it to a new feature in frame
airlines.hex$trip_duration <- depTime - arrTime

Leave a Reply

H2O Wave joins Hacktoberfest

It’s that time of the year again. A great initiative by DigitalOcean called Hacktoberfest that aims to bring

September 29, 2022 - by Martin Turoci
Three Keys to Ethical Artificial Intelligence in Your Organization

There’s certainly been no shortage of examples of AI gone bad over the past few

September 23, 2022 - by H2O.ai Team
Using GraphQL, HTTPX, and asyncio in H2O Wave

Today, I would like to cover the most basic use case for H2O Wave, which is

September 21, 2022 - by Martin Turoci
머신러닝 자동화 솔루션 H2O Driveless AI를 이용한 뇌에서의 성차 예측

Predicting Gender Differences in the Brain Using Machine Learning Automation Solution H2O Driverless AI 아동기 뇌인지

August 29, 2022 - by H2O.ai Team
Make with H2O.ai Recap: Validation Scheme Best Practices

Data Scientist and Kaggle Grandmaster, Dmitry Gordeev, presented at the Make with H2O.ai session on

August 23, 2022 - by Blair Averett
Integrating VSCode editor into H2O Wave

Let’s have a look at how to provide our users with a truly amazing experience

August 18, 2022 - by Martin Turoci

Start Your Free Trial