July 9th, 2015

KFold Cross Validation With H2O-3 and R

RSS icon RSS Category: Uncategorized [EN]
Fallback Featured Image

This blog is also explains the solution to a Google Stream question we received

Note: KFold Cross Validation will be added to H2O-3 as an argument soon

This is a terse guide to building KFold cross-validated models with H2O using the R interface. There's not very much R code needed to get up and running, but it's by no means the one-magic-button method either. This guide is intended for the more “rustic” data scientist that likes to get there hands a bit dirty and build out their own tools.

In about 30 lines of R you'll be able to build the folds, the models, and the predictions! Here's the code in all of its glory:

”Rustic” KFold Cross-Validation Code

h2o.kfold <- function(k,training_frame,X,Y,algo.fun,predict.fun,poll=FALSE) {
  folds <- 1+as.numeric(cut(h2o.runif(training_frame), seq(0,1,1/k), include.lowest=T))
  # launch models
  model.futures <- NULL
  for( i in 1L:k) {
    train <- training_frame[folds!=i,]
    if( is.null(model.futures) ) model.futures <- list(algo.fun(train,X,Y))
    else                         model.futures <- c(model.futures, list(algo.fun(train,X,Y)))
  models <- model.futures
  if( poll ) {
    for( i in 1L:length(models) ) {
      models[[i]] <- h2o.getFutureModel(models[[i]])
  # perform predictions on the holdout data
  preds  <- NULL
  for( i in 1L:k) {
    valid <- training_frame[folds==i,]
    p <- predict.fun(models[[i]], valid)
    if( is.null(preds) ) preds <- p
    else                 preds <- h2o.rbind(preds,p)
  # return the results
  list(models=models, predictions=preds)

tl;dr: You can start using this right away. Here are three examples:

Example 1: 5-fold GBM

# 5-fold GBM:
h2o_gbm <- function(training_frame,X,Y) {<br />
    h2o.gbm(x=X,<br />
            y=Y,<br />
            training_frame=training_frame,<br />
            ntree=1,<br />
            max_depth=1,<br />
            learn_rate=0.01,<br />
            future=TRUE)  # future = TRUE launches model builds in parallel, careful!<br />
kf.gbm <- h2o.kfold(5, fr, X, Y, h2o_gbm, h2o.predict, TRUE)  # poll future models

Example 2: 10-fold Deeplearning

# 10-fold Deeplearning:
h2o_dl  <- function(training_frame,X,Y){
                     l1=1e-4)         # no future since each DL has high Duty Cycle<
kf.dl <- h2o.kfold(10, fr, X, Y, h2o_dl , h2o.predict, FALSE)  # no future models to poll!

Example 3: 10-fold 1-Many Random Forest

# 1-many binomial models with 5-fold cross validation:
rf.one_v_many.futures <- function(training_frame,X,Y) {
  keys.to.clean <- NULL
  nclass <- length(h2o.levels(training_frame[,Y]))
  model.futures <- lapply(0:(nclass-1), function(CLASS) {
    tr <- h2o.cbind(training_frame, as.factor(as.numeric(training_frame[,Y])==CLASS))
    keys.to.clean «- c(keys.to.clean, tr@frame_id)
# poll the models
  models <- lapply(model.futures, function(MODEL) h2o.getFutureModel(MODEL))
# some house keeping
# return the models
kf.rf <- h2o.kfold(3,fr,X,Y,rf.one_v_many.futures,ensemble.predict)  # ensemble.predict is below

Diving In

Let’s step through what this h2o.kfold method does.
Admittedly the API here is clunky, but it will certainly do the job — I'll leave API munging as an exercise for the reader!
Briefly the parameters are:

  • k: the number of folds
  • training_frame: the dataset to do machine learning on
  • X: predictor variables
  • Y: response variable
  • algo.fun: a fully-specified algorithm to perform kfold cross-validation on
  • fun.predict: a predict method
  • poll: if TRUE, then it will attempt to poll future models

In general, fun.predict should be the vanilla h2o.predict method (although more exotic methods are permissible, as hinted at by Example 3 above).

How folds are built:

Many of our R examples make use of h2o.runif to split a dataset into (train,valid,test) tuples:

# some existing dataset
r <- h2o.runif(fr)  # builds a vector the length of fr filled with draws from U(0,1)
train <- fr[r < 0.7,]
valid <- fr[0.7 <= r < 0.8, ]
test  <- fr[r >= 0.8, ]

We can apply the same thinking to assign fold IDs to each row of our input training data. This is exactly what the first line of h2o.kfold does:

folds <- 1+as.numeric(cut(h2o.runif(training_frame), seq(0,1,1/k), include.lowest=T))

This line performs 3 actions:

1. First it builds a vector filled with uniformly random numbers in [0,1).
2. Next the (extremely useful) `cut` method assigns each random value one of k factor levels.
3. Finally, to get the factor levels as integral identifiers from 1, ..., k we add 1 after coercing the column to numeric (adding 1 because H2O is 0-based).

The remainder of the method is not very interesting, except for the asynchronous launch and polling of models. From the R interface, the algorithm methods may take a special parameter future=TRUE to return a model future object, which can be blocked on at a future time (rather than polling at launch).

Predicting 1-Many Models

Building off of the one-versus-many code in Example 3, then the predict code should look something like

ensemble.predict &lt;- function(models,valid_data) {
  probs &lt;- .binomial.predict.helper(models,valid_data)
  p_valid &lt;- h2o:::h2o.which.max(probs[[1]])
  res &lt;- h2o.cbind(p_valid,probs[[1]][,-1])
.binomial.predict.helper &lt;- function(models,data) {
  keys.to.clean &lt;- NULL
  threshes &lt;- NULL
  Y &lt;- ncol(data)  # assumes that response is last vec...
  res &lt;- lapply(0L:(length(models)-1L), function(ID) {
    d &lt;- h2o.cbind(data, as.numeric(data[,Y])==ID)
    p &lt;- h2o.performance(models[[ID+1]], d)
    t &lt;- h2o.find_threshold_by_max_metric(p, "f1")
    pred &lt;- h2o.predict(models[[ID+1]], d)
    cp &lt;- ifelse(pred[,3] &gt;= t, pred[,3], 0)
    keys.to.clean &lt;&lt;- c(keys.to.clean, d@frame_id, pred@frame_id)
    threshes &lt;- c(threshes, t)
  res &lt;- h2o.cbind(res)

This constructs class probabilties for each of the classes based on a threshold computed over the holdout data from the kfold cross validation (it altrnatively takes any input vector of thresholds).

Leave a Reply

H2O Wave joins Hacktoberfest

It’s that time of the year again. A great initiative by DigitalOcean called Hacktoberfest that aims to bring

September 29, 2022 - by Martin Turoci
Three Keys to Ethical Artificial Intelligence in Your Organization

There’s certainly been no shortage of examples of AI gone bad over the past few

September 23, 2022 - by H2O.ai Team
Using GraphQL, HTTPX, and asyncio in H2O Wave

Today, I would like to cover the most basic use case for H2O Wave, which is

September 21, 2022 - by Martin Turoci
머신러닝 자동화 솔루션 H2O Driveless AI를 이용한 뇌에서의 성차 예측

Predicting Gender Differences in the Brain Using Machine Learning Automation Solution H2O Driverless AI 아동기 뇌인지

August 29, 2022 - by H2O.ai Team
Make with H2O.ai Recap: Validation Scheme Best Practices

Data Scientist and Kaggle Grandmaster, Dmitry Gordeev, presented at the Make with H2O.ai session on

August 23, 2022 - by Blair Averett
Integrating VSCode editor into H2O Wave

Let’s have a look at how to provide our users with a truly amazing experience

August 18, 2022 - by Martin Turoci

Start Your Free Trial