The H2O team spent most of the useR! Aalborg 2015 conference at the booth giving demos and discussing H2O. Amy had a 16 node EC2 cluster running with 8 cores per node, making a total of 128 CPUs. The demo consisted of loading large files in parallel and then running our distributed machine learning algos in parallel.
At an R conference, most people wanted to script H2O from R, which is of course built-in (as is Python) but we also conveyed the benefits that our user interface Flow can provide in this space (even for programmers) by automating and accelerating common tasks. We enjoyed discussing future directions with and bouncing ideas off of the attendees. There is nothing like seeing people’s first reaction to the product, live and in person! As an open source platform, H2O thrives on suggestions and contributions from our community.
All components of H2O are developed-in-the-open on GitHub.
H2O contributed 3 talks:
Matt Dowle presented on the details and benchmarks of the fast and stable radix sort implementation in data.table:::forderv. On 500 million random numerics (4 GB), base R takes approximately 22 minutes vs forder at 2 minutes. He discussed the pros and cons of most-significant-digit (forwards) and least-significant-digit (backwards) as well as application to all types: integer with large range (>1e5), numeric and character. We hope to find a sponsor from the R core team to help us include this method in base R where it could benefit the community automatically. The work builds on articles by Terdiman, 2000 and Herf, 2001 and is joint work with Arun Srinivasan.
Slides: Fast, stable and scalable true radix sorting with Matt Dowle at useR! Aalborg
Erin presented an overview of scalable ensemble learning in R using the h2oEnsemble
R package. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity or training time. This R interface provides easy access to scalable ensemble learning using H2O. The H2O Ensemble software implements the Super Learner, or stacking, ensemble algorithm, using distributed base learning algorithms from the open source machine learning platform, H2O. The following base learner algorithms are currently supported in h2oEnsemble
: Generalized linear models with elastic net regularization, Gradient Boosting (GBM) with regression and classification trees, Random Forest and Deep Learning (multi-layer feed-forward neural networks ). Erin provided code examples and some simple benchmarks.
Slides: h2oensemble with Erin Ledell at useR! Aalborg
Amy presented H2O at the useR! sponsor talk and went over the architecture of our product. Her live demo showed the speed and scale of H2O through an R interface. On top of reading in data and aggregating columnar data at lightning fast speed, H2O also comes with a suite of sophisticated models with all the parameters exposed to the front end for ease of use. This attracted discussion at our booth even as the conference came to a close and we began packing up our banners. Many academics expressed interest in using H2O to teach students Machine Learning algorithms , while people in the industry discussed partnerships and use cases. The emphasis of the talk is to encourage R users to try H2O and build a community of users with interesting questions, ideas, and feedback who can ultimately help provide a better open source H2O experience for everyone.
Slides: H2O Overview with Amy Wang at useR! Aalborg
Matt also stopped by Copenhagen to give a talk at the R Summit. You can find his R Summit slides on our Slideshare
Check out our Github page for instructions, scripts, and datasets.
Click here for R demos
Special thanks to the useR! organizing committee and all the people who stopped by our booth!