March 24th, 2014

Data Munging in H2O+R

RSS icon RSS Category: Uncategorized [EN]
Fallback Featured Image

Over the weekend we fielded a question from one of our users about the basics of data munging in H2O through R – and it was a good question, so I wanted to share the response with a wider audience – namely you guys.

There are a few quick things about data munging in  H2O+R:
– It often looks and feels like you are manipulating data in R; we designed it to work that way.  However, for all of the trappings of old R, once you've passed data to H2O, all of your data munging is taking place on the H2O cluster, and the information you see is being passed to R through JSON.
– This means that you aren't limited by the ceiling on R's ability to handle data, you are now limited by the total amount of memory you initially allocated to your cluster.
– It also means that some commands should be undertaken with care. For example, it's now possible to manipulate datasets with thousands of factor levels. Asking H2O+R to return a table displaying information from high cardinality factors can produce results of overwhelming volume.
-If it's necessary to pass data back and forth from H2O to R (i.e., if you want to manipulate data in the R environment, and not on the H2O server) the R calls str(), as.data.frame() and as.h2o() are your new best friends*.
In this context as.data.frame(h2o data set) turns data into a data frame in the R environment.  Use it wisely – it's entirely possible to ask H2O to pass R many millions of observations and quickly exceed R's capacities – turning your R session into a paperweight. I highly recommend that if you must take data from H2O into R, that you take only the data you absolutely need- pass in a single column or two, but avoid moving the whole large data set if you can.
On the other hand, as.h2o(R data set) can take data from R and pass it to the server. Here the limitation is the amount of memory allocated to H2O, and frankly, R will top out long before H2O does – so if you can work with data in R, you can surely pass it to your cluster.
Double checking that your data were communicated correctly uses str(data set). It's a good double check when you've passed information back and forth to ensure that factors are still being treated as factors, that NA's have been treated appropriately, and all of the other little details that are important moving forward with analysis.
*There are examples in the attached R script, as well as the data set for those examples.  It's easy enough to cut and paste or command+enter through the R file, but you will need to 1. Start your own instance of h2o (or at least have one available) before getting to work and 2.Specify your own file path in line 7 – the path that is included is a place holder, and it will only match your path by accident.
Get the Data-     https://raw.githubusercontent.com/0xdata/h2o/master/smalldata/cebexpanded.csv
Get the R script –  https://github.com/0xdata/h2o/blob/master/R/examples/QuickExampleMar24.R

Leave a Reply

+
A Brief Overview of AI Governance for Responsible Machine Learning Systems

Our paper “A Brief Overview of AI Governance for Responsible Machine Learning Systems” was recently

November 30, 2022 - by Navdeep Gill, Abhishek Mathur and Marcos V. Conde
+
H2O World Dallas Customer Talks

After three long years of not having an #H2OWorld, we finally held our first one

November 24, 2022 - by Vinod Iyengar
+
New in Wave 0.24.0

Another Wave release has arrived with quite a few exciting new features. Let's quickly go

November 21, 2022 - by Martin Turoci
Fallback Featured Image
+
H2O.ai Raises $40 Million to Democratize Artificial Intelligence for the Enterprise

Series C round led by Wells Fargo and NVIDIA MOUNTAIN VIEW, CA – November 30, 2017

November 20, 2022 - by
+
H2O.ai Placed Furthest in Completeness of Vision in 2021 Gartner Data Science and Machine Learning Magic Quadrant in the Visionaries Quadrant. — Copy

At H2O.ai, our mission is to democratize AI, and we believe driving value from data

November 18, 2022 - by Read Maloney, SVP of Marketing
+
H2O.ai Expands Market Footprint in Healthcare AI by Signing Hackensack Meridian Health and Other Key Providers

We’re excited to attend the HLTH conference this week in Las Vegas, NV. This industry

November 14, 2022 - by Prashant Natarajan

Start Your Free Trial