July 9th, 2015

KFold Cross Validation With H2O-3 and R

RSS icon RSS Category: Uncategorized [EN]
Fallback Featured Image

This blog is also explains the solution to a Google Stream question we received

Note: KFold Cross Validation will be added to H2O-3 as an argument soon

This is a terse guide to building KFold cross-validated models with H2O using the R interface. There’s not very much R code needed to get up and running, but it’s by no means the one-magic-button method either. This guide is intended for the more “rustic” data scientist that likes to get there hands a bit dirty and build out their own tools.

In about 30 lines of R you’ll be able to build the folds, the models, and the predictions! Here’s the code in all of its glory:

”Rustic” KFold Cross-Validation Code

h2o.kfold <- function(k,training_frame,X,Y,algo.fun,predict.fun,poll=FALSE) {
  folds <- 1+as.numeric(cut(h2o.runif(training_frame), seq(0,1,1/k), include.lowest=T))
  # launch models
  model.futures <- NULL
  for( i in 1L:k) {
    train <- training_frame[folds!=i,]
    if( is.null(model.futures) ) model.futures <- list(algo.fun(train,X,Y))
    else                         model.futures <- c(model.futures, list(algo.fun(train,X,Y)))
  models <- model.futures
  if( poll ) {
    for( i in 1L:length(models) ) {
      models[[i]] <- h2o.getFutureModel(models[[i]])
  # perform predictions on the holdout data
  preds  <- NULL
  for( i in 1L:k) {
    valid <- training_frame[folds==i,]
    p <- predict.fun(models[[i]], valid)
    if( is.null(preds) ) preds <- p
    else                 preds <- h2o.rbind(preds,p)
  # return the results
  list(models=models, predictions=preds)

tl;dr: You can start using this right away. Here are three examples:

Example 1: 5-fold GBM

# 5-fold GBM:
h2o_gbm <- function(training_frame,X,Y) {<br />
    h2o.gbm(x=X,<br />
            y=Y,<br />
            training_frame=training_frame,<br />
            ntree=1,<br />
            max_depth=1,<br />
            learn_rate=0.01,<br />
            future=TRUE)  # future = TRUE launches model builds in parallel, careful!<br />
kf.gbm <- h2o.kfold(5, fr, X, Y, h2o_gbm, h2o.predict, TRUE)  # poll future models

Example 2: 10-fold Deeplearning

# 10-fold Deeplearning:
h2o_dl  <- function(training_frame,X,Y){
                     l1=1e-4)         # no future since each DL has high Duty Cycle<
kf.dl <- h2o.kfold(10, fr, X, Y, h2o_dl , h2o.predict, FALSE)  # no future models to poll!

Example 3: 10-fold 1-Many Random Forest

# 1-many binomial models with 5-fold cross validation:
rf.one_v_many.futures <- function(training_frame,X,Y) {
  keys.to.clean <- NULL
  nclass <- length(h2o.levels(training_frame[,Y]))
  model.futures <- lapply(0:(nclass-1), function(CLASS) {
    tr <- h2o.cbind(training_frame, as.factor(as.numeric(training_frame[,Y])==CLASS))
    keys.to.clean «- c(keys.to.clean, tr@frame_id)
# poll the models
  models <- lapply(model.futures, function(MODEL) h2o.getFutureModel(MODEL))
# some house keeping
# return the models
kf.rf <- h2o.kfold(3,fr,X,Y,rf.one_v_many.futures,ensemble.predict)  # ensemble.predict is below

Diving In

Let’s step through what this h2o.kfold method does.
Admittedly the API here is clunky, but it will certainly do the job — I’ll leave API munging as an exercise for the reader!
Briefly the parameters are:

  • k: the number of folds
  • training_frame: the dataset to do machine learning on
  • X: predictor variables
  • Y: response variable
  • algo.fun: a fully-specified algorithm to perform kfold cross-validation on
  • fun.predict: a predict method
  • poll: if TRUE, then it will attempt to poll future models

In general, fun.predict should be the vanilla h2o.predict method (although more exotic methods are permissible, as hinted at by Example 3 above).

How folds are built:

Many of our R examples make use of h2o.runif to split a dataset into (train,valid,test) tuples:

# some existing dataset
r <- h2o.runif(fr)  # builds a vector the length of fr filled with draws from U(0,1)
train <- fr[r < 0.7,]
valid <- fr[0.7 <= r < 0.8, ]
test  <- fr[r >= 0.8, ]

We can apply the same thinking to assign fold IDs to each row of our input training data. This is exactly what the first line of h2o.kfold does:

folds <- 1+as.numeric(cut(h2o.runif(training_frame), seq(0,1,1/k), include.lowest=T))

This line performs 3 actions:

1. First it builds a vector filled with uniformly random numbers in [0,1).
2. Next the (extremely useful) `cut` method assigns each random value one of k factor levels.
3. Finally, to get the factor levels as integral identifiers from 1, ..., k we add 1 after coercing the column to numeric (adding 1 because H2O is 0-based).

The remainder of the method is not very interesting, except for the asynchronous launch and polling of models. From the R interface, the algorithm methods may take a special parameter future=TRUE to return a model future object, which can be blocked on at a future time (rather than polling at launch).

Predicting 1-Many Models

Building off of the one-versus-many code in Example 3, then the predict code should look something like

ensemble.predict &lt;- function(models,valid_data) {
  probs &lt;- .binomial.predict.helper(models,valid_data)
  p_valid &lt;- h2o:::h2o.which.max(probs[[1]])
  res &lt;- h2o.cbind(p_valid,probs[[1]][,-1])
.binomial.predict.helper &lt;- function(models,data) {
  keys.to.clean &lt;- NULL
  threshes &lt;- NULL
  Y &lt;- ncol(data)  # assumes that response is last vec...
  res &lt;- lapply(0L:(length(models)-1L), function(ID) {
    d &lt;- h2o.cbind(data, as.numeric(data[,Y])==ID)
    p &lt;- h2o.performance(models[[ID+1]], d)
    t &lt;- h2o.find_threshold_by_max_metric(p, "f1")
    pred &lt;- h2o.predict(models[[ID+1]], d)
    cp &lt;- ifelse(pred[,3] &gt;= t, pred[,3], 0)
    keys.to.clean &lt;&lt;- c(keys.to.clean, d@frame_id, pred@frame_id)
    threshes &lt;- c(threshes, t)
  res &lt;- h2o.cbind(res)

This constructs class probabilties for each of the classes based on a threshold computed over the holdout data from the kfold cross validation (it altrnatively takes any input vector of thresholds).

Leave a Reply

Recap of H2O World India 2023: Advancements in AI and Insights from Industry Leaders

On April 19th, the H2O World  made its debut in India, marking yet another milestone

May 29, 2023 - by Parul Pandey
Enhancing H2O Model Validation App with h2oGPT Integration

As machine learning practitioners, we’re always on the lookout for innovative ways to streamline and

May 17, 2023 - by Parul Pandey
Building a Manufacturing Product Defect Classification Model and Application using H2O Hydrogen Torch, H2O MLOps, and H2O Wave

Primary Authors: Nishaanthini Gnanavel and Genevieve Richards Effective product quality control is of utmost importance in

May 15, 2023 - by Shivam Bansal
AI for Good hackathon
Insights from AI for Good Hackathon: Using Machine Learning to Tackle Pollution

At H2O.ai, we believe technology can be a force for good, and we're committed to

May 10, 2023 - by Parul Pandey and Shivam Bansal
H2O democratizing LLMs
Democratization of LLMs

Every organization needs to own its GPT as simply as we need to own our

May 8, 2023 - by Sri Ambati
h2oGPT blog header
Building the World’s Best Open-Source Large Language Model: H2O.ai’s Journey

At H2O.ai, we pride ourselves on developing world-class Machine Learning, Deep Learning, and AI platforms.

May 3, 2023 - by Arno Candel

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More