Stacked Ensembles and Word2Vec now available in H2O!

Published: February 08, 2017

min read

Written by: H2O.ai Team

Prepared by: Erin LeDell and Navdeep Gill

Stacked Ensembles

ensemble <- h2o.stackedEnsemble(x = x, y = y, training_frame = train, base_models = my_models)

Python:

ensemble = H2OStackedEnsembleEstimator(base_models=my_models)
ensemble.train(x=x, y=y, training_frame=train)

Full R and Python code examples are available on the Stacked Ensembles docs page . Kagglers rejoice!

Word2Vec

Technical Details

H2O’s Word2Vec is based on the skip-gram model. The training objective of skip-gram is to learn word vector representations that are good at predicting its context in the same sentence. Mathematically, given a sequence of training words $w_1, w_2, \dots, w_T$, the objective of the skip-gram model is to maximize the average log-likelihood
$$\frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)$$
where $k$ is the size of the training window.
In the skip-gram model, every word w is associated with two vectors $u_w$ and $v_w$ which are vector representations of $w$ as word and context respectively. The probability of correctly predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is
$$p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})}$$
where $V$ is the vocabulary size.
The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$ is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec, we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to $O(\log(V))$

Tverberg Release (H2O 3.10.3.4)

Below is a detailed list of all the items that are part of the Tverberg release.
List of New Features:
PUBDEV-2058 – Implement word2vec in h2o (To use this feature in R, please visit this demo )
PUBDEV-3635 – Ability to Select Columns for PDP computation in Flow (With this enhancement, users will be able to select which features/columns to render Partial Dependence Plots from Flow. (R/Python supported already). Known issue PUBDEV-3782 : when nbins < categorical levels, PDP won’t compute. Please visit also this post .)
PUBDEV-3881 – Add PCA Estimator documentation to Python API Docs
PUBDEV-3902 – Documentation: Add information about Azure support to H2O User Guide (Beta)
PUBDEV-3739 – StackedEnsemble: put ensemble creation into the back end.

List of Improvements:
PUBDEV-3989 – Decrease size of h2o.jar
PUBDEV-3257 – Documentation: As a K-Means user, I want to be able to better understand the parameters
PUBDEV-3741 – StackedEnsemble: add tests in R and Python to ensure that a StackedEnsemble performs at least as well as the base_models
PUBDEV-3857 – Clean up the generated Python docs
PUBDEV-3895 – Filter H2OFrame on pandas dates and time (python)
PUBDEV-3912 – Provide way to specify context_path via Python/R h2o.init methods
PUBDEV-3933 – Modify gen_R.py for Stacked Ensemble
PUBDEV-3972 – Add Stacked Ensemble code examples to Python docstrings
List of Bugs:
PUBDEV-2464 – Using asfactor() in Python client cannot allocate to a variable
PUBDEV-3111 – R API’s h2o.interaction() does not use destination_frame argument
PUBDEV-3694 – Errors with PCA on wide data for pca_method = GramSVD which is the default
PUBDEV-3742 – StackedEnsemble should work for regression
PUBDEV-3865 – h2o gbm : for an unseen categorical level, discrepancy in predictions when score using h2o vs pojo/mojo
PUBDEV-3883 – Negative indexing for H2OFrame is buggy in R API
PUBDEV-3894 – Relational operators don’t work properly with time columns.
PUBDEV-3966 – java.lang.AssertionError when using h2o.makeGLMModel
PUBDEV-3835 – Standard Errors in GLM: calculating and showing specifically when called
PUBDEV-3965 – Importing data in python returns error – TypeError: expected string or bytes-like object
Hotfix: Remove StackedEnsemble from Flow UI. Training is only supported from Python and R interfaces. Viewing is supported in the Flow UI.
List of Tasks
PUBDEV-3336 – h2o.create_frame(): if randomize=True, value param cannot be used
PUBDEV-3740 – REST: implement simple ensemble generation API
PUBDEV-3843 – Modify R REST API to always return binary data
PUBDEV-3844 – Safe GET calls for POJO/MOJO/genmodel
PUBDEV-3864 – Import files by pattern
PUBDEV-3884 – StackedEnsemble: Add to online documentation
PUBDEV-3940 – Add Stacked Ensemble code examples to R docs
Download here: http://h2o-release.s3.amazonaws.com/h2o/rel-tverberg/4/index.html

H2O.ai Team

At H2O.ai, democratizing AI isn’t just an idea. It’s a movement. And that means that it requires action. We started out as a group of like minded individuals in the open source community, collectively driven by the idea that there should be freedom around the creation and use of AI.

Today we have evolved into a global company built by people from a variety of different backgrounds and skill sets, all driven to be part of something greater than ourselves. Our partnerships now extend beyond the open-source community to include business customers, academia, and non-profit organizations.

BACK TO LIST