June 10th, 2015

Scaling R with H2O

RSS icon RSS Category: Uncategorized [EN]
summary_step1

In the advent of H2O 3.0 it seems appropriately timed to reintroduce the R API for H2O to help users better understand the differences between R dataframes and H2OFrames. Typically some of the first questions we get include:

  • Does H2O support all R packages and functions?
  • Is H2OFrame an extension of data.frame?
  • Are H2O supported algorithms written on top of preexisting packages in R like glmnet?

Reading in Data

S4 object H2OFrame is a tabular representation of data that has been imported and parsed into H2O’s Distributed Key-Value Store. The object holds information about where the H2O cluster is : conn, what the frame is called on the cluster : frame_id, and the first 10 rows used when a print of the frame is called. When you use the H2OFrame object in a supported function, R simply acts as glue language that allows you to write R code that really makes a call to the Java backend for the expression(s) to be computed.
For example, the user can execute a h2o.importFile command to access a remote H2O cluster and specify the path to import data from. This command gets sent over H2O’s REST API to hit the endpoint on the Java side. Once the import is completed, H2O will return a JSON response that gets summarized into the components described earlier. So most importantly the difference between an R data frame is that it sits in-memory in R while a H2OFrame is just a reference to an object in the DKV.
parse_step1
parse_step2
parse_step3

Summarizing New Frame

Similarly once the user have a data frame and want to summarize it, execute the h2o.summary or summary function. The command will make a call to H2O to execute the MR task which returns a JSON response that is parsed into a table object in R. The input and actual execution for H2O’s summary function is different than the base summary function but the output is still a table object that generic summary returns.
summary_step1
summary_step2

Supported R functions

For a list of all the functions you can apply on the H2OFrame you can bring up the package documentation in the R console by executing ?h2o which will bring up all supported functions.
In short all H2O functions are prefixed with h2o so the user understand that though there are similarities between H2O’s syntax and base R’s syntax, they are essentially different functions.
All of H2O’s algorithms are executed in memory as java tasks so there is no work done in R. When the user calls glmnet, glm, or h2o.glm you are accessing different implementations of GLM. There are however functions that has been overloaded as methods for H2OFrames such as h2o.summary which is also accessible as summary. Unary and binary operation on the frame would not have the h2o prefix as well and you can access a list of these supported operators by executing ?H2OFrame-class in the R console.
Below are some examples of parity of base R functions and H2O specific R functions. The package was written so that there is an equivalence of typical R operations or expressions that can be sent to be computed on the Java side for the most frequently used R functions.
R_H2O_Parity

Examples of Transformation of Data Frame

For an example of how you might perform some simple transformations on your H2OFrame please download a small airlines data and run through the following example:
To start import the data and run the summary function.

## Load library and initialize h2o
library(h2o)
conn <- h2o.init(nthreads = -1)
pathToAirlines <- normalizePath("~/Downloads/allyears2k.csv")
airlines.hex <- h2o.importFile(conn = conn, path = pathToAirlines, destination_frame = "airlines.hex")
## Summary stats, histogram plots
summary(airlines.hex)

Then we want to create a feature indicating how long each trip took. To calculate the trip duration just take the difference between DepTime and ArrTime which were parsed as numerics. So to both we’ll extract the hour and minute and convert it to total time (in minutes) elapsed since 12:00AM. Finally take the difference between arrival and departure time and append it to the airlines frame. The beauty of the h2o package is how easy it is to translate a R code to a H2O+R code. The following parameter creation are the exact commands you would run if you have a R data.frame that was parsed in using read.csv.

## Create trip_duration feature
hour1 <- airlines.hex$ArrTime %/% 100
mins1 <- airlines.hex$ArrTime %% 100
arrTime <- hour1*60+mins1
hour2 <- airlines.hex$DepTime %/% 100
mins2 <- airlines.hex$DepTime %% 100
depTime <- hour2*60+mins2
## Take the difference between the two times and assign it to a new feature in frame
airlines.hex$trip_duration <- depTime - arrTime

Leave a Reply

+
Enhancing H2O Model Validation App with h2oGPT Integration

As machine learning practitioners, we’re always on the lookout for innovative ways to streamline and

May 17, 2023 - by Parul Pandey
+
Building a Manufacturing Product Defect Classification Model and Application using H2O Hydrogen Torch, H2O MLOps, and H2O Wave

Primary Authors: Nishaanthini Gnanavel and Genevieve Richards Effective product quality control is of utmost importance in

May 15, 2023 - by Shivam Bansal
AI for Good hackathon
+
Insights from AI for Good Hackathon: Using Machine Learning to Tackle Pollution

At H2O.ai, we believe technology can be a force for good, and we're committed to

May 10, 2023 - by Parul Pandey and Shivam Bansal
H2O democratizing LLMs
+
Democratization of LLMs

Every organization needs to own its GPT as simply as we need to own our

May 8, 2023 - by Sri Ambati
h2oGPT blog header
+
Building the World’s Best Open-Source Large Language Model: H2O.ai’s Journey

At H2O.ai, we pride ourselves on developing world-class Machine Learning, Deep Learning, and AI platforms.

May 3, 2023 - by Arno Candel
LLM blog header
+
Effortless Fine-Tuning of Large Language Models with Open-Source H2O LLM Studio

While the pace at which Large Language Models (LLMs) have been driving breakthroughs is remarkable,

May 1, 2023 - by Parul Pandey

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More