June 20th, 2017

XGBoost in the H2O Machine Learning Platform

RSS icon RSS Category: Uncategorized [EN], XGBoost
Model parameter different graphs

 

The new H2O release 3.10.5.1 brings a shiny new feature – integration of the powerful XGBoost library algorithm into H2O Machine Learning Platform!
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.
XGBoost provides parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way.
By integrating XGBoost into the H2O Machine Learning platform, we not only enrich the family of provided algorithms by one of the most powerful machine learning algorithms, but we have also exposed it with all the nice features of H2O – Python, R APIs and Flow UI, real-time training progress, and MOJO support.

Example

Let’s quickly try to run XGBoost on the HIGGS dataset from Python. The first step is to get the latest H2O and install the Python library. Please follow instruction at H2O download page.
The next step is to download the HIGGS training and validation data. We can use sample datasets stored in S3:

wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_train_imbalance_100k.csv
wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_test_imbalance_100k.csv
# Or use full data: wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_head_2M.csv

Now, it is time to start your favorite Python environment and build some XGBoost models.
The first step involves starting H2O on single node cluster:

import h2o
h2o.init()

In the next step, we import and prepare data via the H2O API:

train_path = 'higgs_train_imbalance_100k.csv'
test_path = 'higgs_test_imbalance_100k.csv'
df_train = h2o.import_file(train_path)
df_test = h2o.import_file(test_path)
# Transform first feature into categorical feature
df_train[0] = df_train[0].asfactor()
df_valid[0] = df_valid[0].asfactor()

After data preparation, it is time to build an XGBoost model. Let’s try to train 100 trees with a maximum depth of 10:

param = {
      "ntrees" : 100
    , "max_depth" : 10
    , "learn_rate" : 0.02
    , "sample_rate" : 0.7
    , "col_sample_rate_per_tree" : 0.9
    , "min_rows" : 5
    , "seed": 4241
    , "score_tree_interval": 100
}
from h2o.estimators import H2OXGBoostEstimator
model = H2OXGBoostEstimator(**param)
model.train(x = list(range(1, df_train.shape[1])), y = 0, training_frame = df_train, validation_frame = df_valid)

At this point we can use the trained model like a normal H2O model, and for example use it to generate predictions:

prediction = model.predict(df_valid)[:,2]

Or we can open H2O Flow UI and explore model properties in nice user-friendly way:
Model parameter different graphs
Or rebuild model with different training parameters:
Build a model

Technical Details

The integration of XGBoost into the H2O Machine Learning Platform utilizes the JNI interface of XGBoost and the corresponding native libraries. H2O wraps all JNI calls and exposes them as regular H2O model and model builder APIs.
The implementation itself is based on two separated modules, which are enriching the core H2O platform.
The first module, h2o-genmodel-ext-xgboost, extends module h2o-genmodel and registers an XGBoost-specific MOJO. The module also contains all necessary XGBoost binary libraries. Right now, the module provides libraries for OS X and Linux, however support of Windows is coming soon.
The module can contain multiple libraries for each platform to support different configurations (e.g., with/without GPU/OMP). H2O always tries to load the most powerful one (currently a library with GPU and OMP support). If it fails, the loader tries the next one in a loader chain. For each platform, we always provide an XGBoost library with minimal configuration (supports only single CPU) that serves as fallback in case all other libraries could not be loaded.
The second module, h2o-ext-xgboost, contains the actual XGBoost model and model builder code, which communicates with native XGBoost libraries via the JNI API. The module also provides all necessary REST API definitions to expose XGBoost model builder to clients.

Note: To learn more about H2O modular architecture, please, visit review our H2O Platform Extensibility blog post.

Limitations

There are several technical limitations of the current implementation that we are trying to resolve. However, it is necessary to mention them. In general, if XGBoost cannot be initialized for any reason (e.g., unsupported platform), then the algorithm is not exposed via REST API and is not available for clients. Clients can verify availability of the XGBoost by using the corresponding client API call. For example, in Python:

is_xgboost_available = H2OXGBoostEstimator.available()

The list of limitations include:

  1. Right now XGBoost is initialized only for single-node H2O clustersl however multi-node XGBoost support is coming soon.
  2. The list of supported platforms includes:
    Platform Minimal XGBoost OMP GPU Compilation OS
    Linux yes yes yes Ubuntu 14.04, g++ 4.7
    OS X yes no no OS X 10.11
    Windows no no no NA

    Note: Minimal XGBoost configuration includes support for a single CPU.

  3. Furthermore, because we are using native XGBoost libraries that depend on OS/platform libraries, it is possible that on older operating systems, XGBoost will not be able to find all necessary binary dependencies, and will not be initialized and available.
  4. XGBoost GPU libraries are compiled against CUDA 8, which is a necessary runtime requirement in order to utilize XGBoost GPU support.

Please give H2O XGBoost chance, try it, and let us know your experience or suggest improvements via h2ostream!

Leave a Reply

+
Recap of H2O World India 2023: Advancements in AI and Insights from Industry Leaders

On April 19th, the H2O World  made its debut in India, marking yet another milestone

May 29, 2023 - by Parul Pandey
+
Enhancing H2O Model Validation App with h2oGPT Integration

As machine learning practitioners, we’re always on the lookout for innovative ways to streamline and

May 17, 2023 - by Parul Pandey
+
Building a Manufacturing Product Defect Classification Model and Application using H2O Hydrogen Torch, H2O MLOps, and H2O Wave

Primary Authors: Nishaanthini Gnanavel and Genevieve Richards Effective product quality control is of utmost importance in

May 15, 2023 - by Shivam Bansal
AI for Good hackathon
+
Insights from AI for Good Hackathon: Using Machine Learning to Tackle Pollution

At H2O.ai, we believe technology can be a force for good, and we're committed to

May 10, 2023 - by Parul Pandey and Shivam Bansal
H2O democratizing LLMs
+
Democratization of LLMs

Every organization needs to own its GPT as simply as we need to own our

May 8, 2023 - by Sri Ambati
h2oGPT blog header
+
Building the World’s Best Open-Source Large Language Model: H2O.ai’s Journey

At H2O.ai, we pride ourselves on developing world-class Machine Learning, Deep Learning, and AI platforms.

May 3, 2023 - by Arno Candel

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More