In the advent of H2O 3.0 it seems appropriately timed to reintroduce the R API for H2O to help users better understand the differences between R dataframes and H2OFrames. Typically some of the first questions we get include:
H2OFrame an extension of
H2OFrame is a tabular representation of data that has been imported and parsed into H2O’s Distributed Key-Value Store. The object holds information about where the H2O cluster is :
conn, what the frame is called on the cluster :
frame_id, and the first 10 rows used when a print of the frame is called. When you use the H2OFrame object in a supported function, R simply acts as glue language that allows you to write R code that really makes a call to the Java backend for the expression(s) to be computed.
For example, the user can execute a
h2o.importFile command to access a remote H2O cluster and specify the path to import data from. This command gets sent over H2O’s REST API to hit the endpoint on the Java side. Once the import is completed, H2O will return a JSON response that gets summarized into the components described earlier. So most importantly the difference between an R data frame is that it sits in-memory in R while a H2OFrame is just a reference to an object in the DKV.
Similarly once the user have a data frame and want to summarize it, execute the
summary function. The command will make a call to H2O to execute the MR task which returns a JSON response that is parsed into a table object in R. The input and actual execution for H2O’s summary function is different than the base summary function but the output is still a table object that generic summary returns.
For a list of all the functions you can apply on the H2OFrame you can bring up the package documentation in the R console by executing
?h2o which will bring up all supported functions.
In short all H2O functions are prefixed with h2o so the user understand that though there are similarities between H2O’s syntax and base R’s syntax, they are essentially different functions.
All of H2O’s algorithms are executed in memory as java tasks so there is no work done in R. When the user calls
h2o.glm you are accessing different implementations of GLM. There are however functions that has been overloaded as methods for H2OFrames such as
h2o.summary which is also accessible as
summary. Unary and binary operation on the frame would not have the h2o prefix as well and you can access a list of these supported operators by executing
?H2OFrame-class in the R console.
Below are some examples of parity of base R functions and H2O specific R functions. The package was written so that there is an equivalence of typical R operations or expressions that can be sent to be computed on the Java side for the most frequently used R functions.
For an example of how you might perform some simple transformations on your H2OFrame please download a small airlines data and run through the following example:
To start import the data and run the summary function.
## Load library and initialize h2o
conn <- h2o.init(nthreads = -1)
pathToAirlines <- normalizePath("~/Downloads/allyears2k.csv")
airlines.hex <- h2o.importFile(conn = conn, path = pathToAirlines, destination_frame = "airlines.hex")
## Summary stats, histogram plots
Then we want to create a feature indicating how long each trip took. To calculate the trip duration just take the difference between
ArrTime which were parsed as numerics. So to both we’ll extract the hour and minute and convert it to total time (in minutes) elapsed since 12:00AM. Finally take the difference between arrival and departure time and append it to the airlines frame. The beauty of the h2o package is how easy it is to translate a R code to a H2O+R code. The following parameter creation are the exact commands you would run if you have a R data.frame that was parsed in using
## Create trip_duration feature
hour1 <- airlines.hex$ArrTime %/% 100
mins1 <- airlines.hex$ArrTime %% 100
arrTime <- hour1*60+mins1
hour2 <- airlines.hex$DepTime %/% 100
mins2 <- airlines.hex$DepTime %% 100
depTime <- hour2*60+mins2
## Take the difference between the two times and assign it to a new feature in frame
airlines.hex$trip_duration <- depTime - arrTime