Return to page

BLOG

H2O Release 3.28 (Yu)

 headshot

By Michal Kurka | minute read | December 20, 2019

Category: H2O Release
Blog decorative banner image

There’s a new major release of H2O, and it’s packed with new features and fixes! Among the big new features in this release, we’ve introduced support for Hierarchical GLM, added an option to parallelize Grid Search, upgraded XGBoost with newly added features, and improved our AutoML  framework. The release is named after Bin Yu .

Hierarchical GLM

We are very excited to add HGLM (Hierarchical GLM) to our open source offering. HGLM fits generalized linear models  with random effects, where the random effect can come from a conjugate exponential-family distribution (for example, Gaussian). HGLM allows you to specify both fixed and random effects, which allows fitting correlated to random effects as well as random regression models. HGLM can be used for linear mixed models and for generalized linear mixed models with random effects for a variety of links and a variety of distributions for both the outcomes and the random effects.

You can review the detailed documentation here . This release implements HGLM for the Gaussian family. However, stay tuned or better yet, tell us which distributions you want to see next. Try it out and send us your feedback! In addition, we would also like your feedback on the model metrics that you are interested in seeing.

Parallelized Grid Search

This release adds a new way to speed up grid search by training n  models in parallel. Both in Python and R, a new parameter has been added to determine the level of parallelism used during Grid Search. By default, sequential model building is ensured with parallelism = 1. There are two additional ways to set the level of parallelism:

  • Manual setting of the number of models built in parallel: parallelism = n where n > 1
  • H2O Heuristics: parallelism = 0

H2O will always attempt to train the number of models determined by the parallelism argument simultaneously. Once a model is finished, it is added to the grid, and another one is started immediately.

XGBoost upgrade and new features (checkpointing, Platt Scaling)

We’ve upgraded the XGBoost library to version 0.90, which brings a wide range of bug fixes and performance improvements. More details can be found in the XGBoost 0.90 Release Notes . As the XGBoost 1.0 release draws near, we will be ready to integrate it with H2O as soon as possible.

Platt Scaling is now available for XGBoost. You can now use the calibrate_model  and calibration_frame  parameters when training an XGBoost model.

XGBoost now supports resuming from a trained model (checkpointing) as well.

Integrations

H2O can now be used on Cloudera Data Platform, an enterprise-ready cloud data environment. In addition to that, we added support for new versions of Hadoop distributions, improved support for Kerberos authentication (SPNEGO) and improved cloud-forming on clusters with restricted network access.

Logging in h2o-genmodel.jar

Logging possibilities for the hex.genmodel.easy.EasyPredictModelWrapper  with contributions enabled were extended. H2O now supports the SLF4J logging library, but no SLF4J library is bundled in the H2O module; therefore, you must ensure the library is present on the classpath in order to use this new functionality. When there is no SLF4J library on the classpath, original logging functionality is preserved.

AutoML Improvements

This new release also comes with a set of new features for H2O AutoML.

Monotonic constraints  can now be enforced in AutoML to improve the predictive performance of the models. In some cases, constraints can be used to improve the predictive performance of the model. Among the set of models trained in an AutoML run, only XGBoost and H2O GBM models are able to enforce monotonicity; however, this can also improve the Stacked Ensembles.

We are now providing more details about the models produced by AutoML in an extended version of the leaderboard (thanks to the new get_leaderboard  function in the Python and R clients). For now, this includes information like the training time of the model and the average prediction time (per row), but we plan to add more useful model information in future releases. Prediction speed is especially useful to measure when considering which models to deploy to production. The leaderboard now also includes the Area Under the Precision-Recall Curve (AUCPR) as an additional metric for binary classification  problems (also available as a new option for stopping_metric  and sort_metric  in AutoML).

To improve AutoML reproducibility across versions, or simply to give you a bit more control over the training pipeline, we now expose a new modeling_plan  parameter listing the steps taken into consideration by AutoML, and in return, we expose a modeling_steps  property in the AutoML object, which lists all the steps that were actually used during the training. As mentioned, for reproducibility, this last list can in return be fed back into a new AutoML instance by passing its value to the modeling_plan  parameter. This also opens the door to various customizations, like the possibility for you to plug your own steps written in Java, but we’ll tell you more about this very soon.

Finally, you’ll find the usual bundle of bug fixes and minor improvements, among which is an improved AutoML widget in the Flow UI, a more consistent handling of AutoML reruns (when retraining AutoML with the same project name), and a noticeable change in the default value for the max_runtime_secs  parameter, whose previous value appeared inconvenient in many cases.

Scikit-learn integration

Scikit-learn users will be happy to learn that this release comes with a new integration API that removes most limitations of the legacy support.

All sklearn -compatible estimators and transformers are exposed in a new h2o.sklearn  module. They accept the same parameters as their twins from the h2o.estimators  or h2o.automl  modules, but in addition to H2OFrame , they also accept numpy  arrays or pandas  dataframes and can, therefore, be combined with usual Scikit-learn transformers in a Pipeline  (for example). Also, all the standard sklearn  ways of setting or modifying the parameters of those estimators are now supported.

Documentation Updates

This release adds easy-to-follow code examples to most of the functions in the Python Module documentation. This documentation is available at http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html. 

What is coming next?

Constrained K-means

The current K-means algorithm implementation uses the Lloyd iteration principle to determine optimal clusters depending on distances of data from centroids. Within this release, we prepared an improvement to this algorithm and added the possibility to set the minimum number of data points in each cluster.

To satisfy the custom minimal cluster size, the calculation of clusters is converted to the Minimal Cost Flow problem. A graph is constructed based on the distances and constraints. The goal is to go iteratively through the data points represented as input edges of the graph and create an optimal spanning tree that satisfies the constraints. However, the Minimum-cost Flow problem can be efficiently solved in polynomial time.

The performance of our implementation of the constrained K-means algorithm is not optimal because still needs to be improved and we thus didn’t expose the feature to users. The feature will be released in one of the minor releases of the 3.28 release cycle.

Credits

This new H2O release is brought to you by Wendy Wong, Pavel Pscheidl , Jan Sterba, Michal Kurka , Zuzana Olajcova, Erin LeDell , Sebastien Poirier, Angela Bartz , and Veronika Maurerova .

 headshot

Michal Kurka

Michal is a software engineer with a passion for crafting code in Java and other JVM languages. He started his professional career as a J2EE developer and spent his time building all sorts of web and desktop applications. Four years ago he truly found himself when he entered the world of big data processing and Hadoop. Since then he enjoys working with distributed platforms and implementing scalable applications on top of them. He holds a Master of Computer Science form Charles University in Prague. His field of study was Discrete Models and Algorithms with focus on Optimization.