February 22nd, 2018

New features in H2O 3.18

RSS icon RSS Category: AutoML, Ensembles, H2O Release, XGBoost
Xgboost logo

Wolpert Release (H2O 3.18)

There’s a new major release of H2O and it’s packed with new features and fixes!
We named this release after David Wolpert, who is famous for inventing Stacking (aka Stacked Ensembles). Stacking is a central component in H2O AutoML, so we’re very grateful for his contributions to machine learning! He is also famous for the “No Free Lunch” theorem, which generally states that no single algorithm will be the best in all cases. In other words, there’s no magic bullet. This is precisely why stacking is such a powerful and practical algorithm — you never know in advance if a Deep Neural Network, or GBM or Random Forest will be the best algorithm for your problem. When you combine all of these together into a stacked ensemble, you are guaranteed to benefit from the strengths of each of these algorithms. You can read more about Dr. Wolpert and his work here.

Distributed XGBoost

Xgboost logo
The central feature of this release is support for distributed XGBoost, as well as other XGBoost enhancements and bug fixes. We are bringing XGBoost support to more platforms (including older versions of CentOS/Ubuntu) and we now support multi-node XGBoost training (though this feature is still in “beta”).
There are a number of XGBoost bug fixes, such the ability to use XGBoost models after they have been saved to disk and re-loaded into the H2O cluster, and fixes to the XGBoost MOJO. With all the improvements to H2O’s XGBoost, we are much closer to adding XGBoost to AutoML, and you can expect to see that in a future release. You can read more about the H2O XGBoost integration in the XGBoost User Guide.

AutoML & Stacked Ensembles

Black branch tree
One big addition to H2O Automatic Machine Learning (AutoML) is the ability to turn off certain algorithms. By default, H2O AutoML will train Gradient Boosting Machines (GBM), Random Forests (RF), Generalized Linear Models (GLM), Deep Neural Networks (DNN) and Stacked Ensembles. However, sometimes it may be useful to turn off some of those algorithms. In particular, if you have sparse, wide data, you may choose to turn off the tree-based models (GBMs and RFs). Conversely, if tree-based models perform comparatively well on your data, then you may choose to turn off GLMs and DNNs. Keep in mind that Stacked Ensembles benefit from diversity of the set of base learners, so keeping “bad” models may still improve the overall performance of the Stacked Ensembles created by the AutoML run. The new argument is called exclude_algos and you can read more about it in the AutoML User Guide.
There are several improvements to the Stacked Ensemble functionality in H2O 3.18. The big new feature is the ability to fully customize the metalearning algorithm. The default metalearner (a GLM with non-negative weights) usually does pretty well, however, you are encouraged to experiment with other algorithms (such as GBM) and various hyperparameter settings. In the next major release, we will add the ability to easily perform a grid search on the hyperparameters of the metalearner algorithm using the standard H2O Grid Search functionality.

Highlights

Below is a list of some of the highlights from the 3.18 release. As usual, you can see a list of all the items that went into this release at the Changes.md file in the h2o-3 GitHub repository.
New Features:

  • PUBDEV-4652 – Added support for XGBoost multi-node training in H2O
  • PUBDEV-4980 – Users can now exclude certain algorithms during an AutoML run
  • PUBDEV-5086 – Stacked Ensemble should allow user to pass in a customized metalearner
  • PUBDEV-5224 – Users can now specify a seed parameter in Stacked Ensemble
  • PUBDEV-5204 – GLM: Allow user to specify a list of interactions terms to include/exclude

Bugs:

  • PUBDEV-4585 – Fixed an issue that caused XGBoost binary save/load to fail
  • PUBDEV-4593 – Fixed an issue that caused a Levenshtein Distance Normalization Error
  • PUBDEV-5133 – In Flow, the scoring history plot is now available for GLM models
  • PUBDEV-5195 – Fixed an issue in XGBoost that caused MOJOs to fail to work without manually adding the Commons Logging dependency
  • PUBDEV-5215 – Users can now specify interactions when running GLM in Flow
  • PUBDEV-5315 – Fixed an issue that caused XGBoost OpenMP to fail on Ubuntu 14.04

Documentation:

  • PUBDEV-5311 – The H2O-3 download site now includes a link to the HTML version of the R documentation

Download here: http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

Leave a Reply

+
Developing and Retaining Data Science Talent

It’s been almost a decade since the Harvard Business Review proclaimed that “Data Scientist” is

May 12, 2022 - by Jon Farland
+
The H2O.ai Wildfire Challenge Winners Blog Series – Team Too Hot Encoder

Note: this is a community blog post by Team Too Hot Encoder - one of

May 10, 2022 - by H2O.ai Team
+
The H2O.ai Wildfire Challenge Winners Blog Series – Team HTB

Note: this is a community blog post by Team HTB - one of the H2O.ai

May 10, 2022 - by H2O.ai Team
+
Bias and Debiasing

An important aspect of practicing machine learning in a responsible manner is understanding how models

April 15, 2022 - by Kim Montgomery
+
Comprehensive Guide to Image Classification using H2O Hydrogen Torch

In this article, we will learn how to build state-of-the-art models in computer vision and

March 29, 2022 - by H2O.ai Team
+
H2O Wave Snippet Plugin for PyCharm

Note: this blog post by Shamil Dilshan Prematunga was first published on Medium. What is PyCham? PyCharm

March 24, 2022 - by Shamil Prematunga

Start Your Free Trial