There’s a new major release of H2O and it’s packed with new features and fixes! Among the big new features in this release, we introduce Isolation Forest to our portfolio of machine learning algorithms and integrates the XGBoost algorithm into our AutoML framework. The release is named after Zhihong Xia .
Isolation Forest is an unsupervised machine learning algorithm used for anomaly detection. Anomaly detection is applicable to a variety of uses cases, including Fraud Detection or Intrusion Detection. The Isolation Forest algorithm is different from other methods typically used for anomaly detection: it directly identifies the exceptional observations instead of learning the pattern of the normal observations (as is done in the H2O deep learning based autoencoder). The H2O implementation of Isolation Forest is based on the Distributed Random Forest algorithm, so it is capable of analyzing large datasets in multi-node clusters. Note that Isolation Forest is currently in a Beta state. Additional enhancements and improvements will be made in future releases. A blog post is available for more information.
During the development of H2O-3 version 3.21.x, an API for tree inspection was introduced for both the Python and R clients. With the Tree API, it is possible to download, traverse and inspect individual trees inside tree-based algorithms. In this release, this API can be used to fetch any tree from any tree-based model (Gradient Boosting Machines , Distributed Random Forest , XGBoost and Isolation Forest ). For more details, please see our latest documentation for Python and for R . There is also a blog post available.
Our AutoML framework now includes the XGBoost algorithm, one of the most popular and powerful machine learning algorithms. H2O users have been able to leverage the power of XGBoost for quite some time, however, in the 3.22 release we focused on further performance and stability improvements of our XGBoost implementation. Thanks to these improvements we were able to include XGBoost in the fully automated setting of AutoML. XGBoost models built during the AutoML process will also be included in the final Stacked Ensemble models. Because XGBoost models are typically some of the top performers on the AutoML Leaderboard and also since Stacked Ensemble models benefit from the added diversity of models, users can expect that the final performance H2O AutoML to be improved on many datasets.
Feature engineering in H2O has been enhanced with the possibility of encoding categorical variables using mean of a target variable . It can be performed in two easy steps. First step is to create a target-encoding map. As mean encoding is prone to overfitting, there are several ways to avoid it included. Second step is to simply apply the target-encoding map created in the first step. New columns with target-encoding values are then added to the data. Previously, target encoding had only been available in R, but in 3.22, it’s now available in Java and Python as well. For details, please see the documentation .
Below is a list of some of the highlights from the 3.22 release. As usual, you can see a list of all the items that went into this release at the Changes.md file in the h2o-3 GitHub repository.
Download our latest release: http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html