February 8th, 2017

Stacked Ensembles and Word2Vec now available in H2O!

RSS icon RSS Category: Data Munging, Ensembles, H2O Release, NLP, Python, R, Technical
ensemble

Prepared by: Erin LeDell and Navdeep Gill

Stacked Ensembles

sz42-6-wheels-lightened
H2O’s new Stacked Ensemble method is a supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking or “Super Learning.” This method currently supports regression and binary classification, and multiclass support is planned for a future release. A full list of the planned features for Stacked Ensemble can be viewed here.
H2O previously has supported the creation of ensembles of H2O models through a separate implementation, the h2oEnsemble R package, which is still available and will continue to be maintained, however for new projects we’d recommend using the native H2O version. Native support for stacking in the H2O backend brings support for ensembles to all the H2O APIs.
Creating ensembles of H2O models is now dead simple. You simply pass a list of existing H2O model ids to the stacked ensemble function and you are ready to go. This list of models can be a set of manually created H2O models, a random grid of models (of GBMs, for example), or set of grids of different algorithms. Typically, the more diverse the collection of base models, the better the ensemble performance. Thus, using H2O’s Random Grid Search to generate a collection of random models is a handy way of quickly generating a set of base models for the ensemble.
R:

ensemble <- h2o.stackedEnsemble(x = x, y = y, training_frame = train, base_models = my_models)

Python:

ensemble = H2OStackedEnsembleEstimator(base_models=my_models)
ensemble.train(x=x, y=y, training_frame=train)

Full R and Python code examples are available on the Stacked Ensembles docs page. Kagglers rejoice!

Word2Vec

w2v
[latex]
H2O now has a full implementation of Word2Vec. Word2Vec is a group of related models that are used to produce word embeddings (a language modeling/feature engineering technique in natural language processing where words or phrases are mapped to vectors of real numbers). The word embeddings can subsequently be used in a machine learning model, for example, GBM. This allows user to utilize text based data with current H2O algorithms in a very efficient manner. An R example is available here.

Technical Details

H2O’s Word2Vec is based on the skip-gram model. The training objective of skip-gram is to learn word vector representations that are good at predicting its context in the same sentence. Mathematically, given a sequence of training words $w_1, w_2, \dots, w_T$, the objective of the skip-gram model is to maximize the average log-likelihood
$$\frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)$$
where $k$ is the size of the training window.
In the skip-gram model, every word w is associated with two vectors $u_w$ and $v_w$ which are vector representations of $w$ as word and context respectively. The probability of correctly predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is
$$p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})}$$
where $V$ is the vocabulary size.
The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$ is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec, we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to $O(\log(V))$

Tverberg Release (H2O 3.10.3.4)

Below is a detailed list of all the items that are part of the Tverberg release.
List of New Features:
PUBDEV-2058– Implement word2vec in h2o (To use this feature in R, please visit this demo)
PUBDEV-3635– Ability to Select Columns for PDP computation in Flow (With this enhancement, users will be able to select which features/columns to render Partial Dependence Plots from Flow. (R/Python supported already). Known issue PUBDEV-3782: when nbins < categorical levels, PDP won’t compute. Please visit also this post.)
PUBDEV-3881– Add PCA Estimator documentation to Python API Docs
PUBDEV-3902– Documentation: Add information about Azure support to H2O User Guide (Beta)
PUBDEV-3739– StackedEnsemble: put ensemble creation into the back end.

List of Improvements:
PUBDEV-3989– Decrease size of h2o.jar
PUBDEV-3257– Documentation: As a K-Means user, I want to be able to better understand the parameters
PUBDEV-3741– StackedEnsemble: add tests in R and Python to ensure that a StackedEnsemble performs at least as well as the base_models
PUBDEV-3857– Clean up the generated Python docs
PUBDEV-3895– Filter H2OFrame on pandas dates and time (python)
PUBDEV-3912– Provide way to specify context_path via Python/R h2o.init methods
PUBDEV-3933– Modify gen_R.py for Stacked Ensemble
PUBDEV-3972– Add Stacked Ensemble code examples to Python docstrings
List of Bugs:
PUBDEV-2464– Using asfactor() in Python client cannot allocate to a variable
PUBDEV-3111– R API’s h2o.interaction() does not use destination_frame argument
PUBDEV-3694– Errors with PCA on wide data for pca_method = GramSVD which is the default
PUBDEV-3742– StackedEnsemble should work for regression
PUBDEV-3865– h2o gbm : for an unseen categorical level, discrepancy in predictions when score using h2o vs pojo/mojo
PUBDEV-3883– Negative indexing for H2OFrame is buggy in R API
PUBDEV-3894– Relational operators don’t work properly with time columns.
PUBDEV-3966– java.lang.AssertionError when using h2o.makeGLMModel
PUBDEV-3835– Standard Errors in GLM: calculating and showing specifically when called
PUBDEV-3965– Importing data in python returns error – TypeError: expected string or bytes-like object
Hotfix: Remove StackedEnsemble from Flow UI. Training is only supported from Python and R interfaces. Viewing is supported in the Flow UI.
List of Tasks
PUBDEV-3336– h2o.create_frame(): if randomize=True, value param cannot be used
PUBDEV-3740– REST: implement simple ensemble generation API
PUBDEV-3843– Modify R REST API to always return binary data
PUBDEV-3844– Safe GET calls for POJO/MOJO/genmodel
PUBDEV-3864– Import files by pattern
PUBDEV-3884– StackedEnsemble: Add to online documentation
PUBDEV-3940– Add Stacked Ensemble code examples to R docs
Download here: http://h2o-release.s3.amazonaws.com/h2o/rel-tverberg/4/index.html

Leave a Reply

+
H2O LLM DataStudio Part II: Convert Documents to QA Pairs for fine tuning of LLMs

Convert unstructured datasets to Question-answer pairs required for LLM fine-tuning and other downstream tasks with

September 22, 2023 - by Genevieve Richards, Tarique Hussain and Shivam Bansal
+
Building a Fraud Detection Model with H2O AI Cloud

In a previous article[1], we discussed how machine learning could be harnessed to mitigate fraud.

July 28, 2023 - by Asghar Ghorbani
+
A Look at the UniformRobust Method for Histogram Type

Tree-based algorithms, especially Gradient Boosting Machines (GBM's), are one of the most popular algorithms used.

July 25, 2023 - by Hannah Tillman and Megan Kurka
+
H2O LLM EvalGPT: A Comprehensive Tool for Evaluating Large Language Models

In an era where Large Language Models (LLMs) are rapidly gaining traction for diverse applications,

July 19, 2023 - by Srinivas Neppalli, Abhay Singhal and Michal Malohlava
+
Testing Large Language Model (LLM) Vulnerabilities Using Adversarial Attacks

Adversarial analysis seeks to explain a machine learning model by understanding locally what changes need

July 19, 2023 - by Kim Montgomery, Pramit Choudhary and Michal Malohlava
+
Reducing False Positives in Financial Transactions with AutoML

In an increasingly digital world, combating financial fraud is a high-stakes game. However, the systems

July 14, 2023 - by Asghar Ghorbani

Ready to see the H2O.ai platform in action?

Make data and AI deliver meaningful and significant value to your organization with our state-of-the-art AI platform.