KFold Cross Validation With H2O-3 and R

BLOG

This blog is also explains the solution to a Google Stream question we received

Note: KFold Cross Validation will be added to H2O-3 as an argument soon

This is a terse guide to building KFold cross-validated models with H2O using the R interface. There’s not very much R code needed to get up and running, but it’s by no means the one-magic-button method either. This guide is intended for the more “rustic” data scientist that likes to get there hands a bit dirty and build out their own tools.

In about 30 lines of R you’ll be able to build the folds, the models, and the predictions! Here’s the code in all of its glory:

”Rustic” KFold Cross-Validation Code

h2o.kfold <- function(k,training_frame,X,Y,algo.fun,predict.fun,poll=FALSE) {
 folds <- 1+as.numeric(cut(h2o.runif(training_frame), seq(0,1,1/k), include.lowest=T))
 print(dim(folds))
 # launch models
 model.futures <- NULL
 for( i in 1L:k) {
 train <- training_frame[folds!=i,]
 if( is.null(model.futures) ) model.futures <- list(algo.fun(train,X,Y))
 else model.futures <- c(model.futures, list(algo.fun(train,X,Y)))
 }
 models <- model.futures
 if( poll ) {
 for( i in 1L:length(models) ) {
 models[[i]] <- h2o.getFutureModel(models[[i]])
 }
 }
 # perform predictions on the holdout data
 preds <- NULL
 for( i in 1L:k) {
 valid <- training_frame[folds==i,]
 p <- predict.fun(models[[i]], valid)
 if( is.null(preds) ) preds <- p
 else preds <- h2o.rbind(preds,p)
 }
 # return the results
 list(models=models, predictions=preds)
}

tl;dr : You can start using this right away. Here are three examples:

Example 1: 5-fold GBM

# 5-fold GBM:
h2o_gbm <- function(training_frame,X,Y) {<br />
 h2o.gbm(x=X,<br />
 y=Y,<br />
 training_frame=training_frame,<br />
 ntree=1,<br />
 max_depth=1,<br />
 learn_rate=0.01,<br />
 future=TRUE) # future = TRUE launches model builds in parallel, careful!<br />
}
kf.gbm <- h2o.kfold(5, fr, X, Y, h2o_gbm, h2o.predict, TRUE) # poll future models

Example 2: 10-fold Deeplearning

# 10-fold Deeplearning:
h2o_dl <- function(training_frame,X,Y){
 h2o.deeplearning(x=X,
 y=Y,
 training_frame=training_frame,
 hidden=c(200,200,200),
 activation=”RectifierWithDropout”,
 input_dropout_ratio=0.3,
 hidden_dropout_ratios=c(0.5,0.5,0.5),
 l1=1e-4) # no future since each DL has high Duty Cycle<
}
kf.dl <- h2o.kfold(10, fr, X, Y, h2o_dl , h2o.predict, FALSE) # no future models to poll!

Example 3: 10-fold 1-Many Random Forest

# 1-many binomial models with 5-fold cross validation:
rf.one_v_many.futures <- function(training_frame,X,Y) {
 keys.to.clean <- NULL
 nclass <- length(h2o.levels(training_frame[,Y]))
 model.futures <- lapply(0:(nclass-1), function(CLASS) {
 tr <- h2o.cbind(training_frame, as.factor(as.numeric(training_frame[,Y])==CLASS))
 keys.to.clean «- c(keys.to.clean, tr@frame_id)
 h2o.randomForest(x=X,
 y=ncol(tr),
 training_frame=tr,
 ntree=50,
 max_depth=20,
 future=TRUE)
 })
# poll the models
 models <- lapply(model.futures, function(MODEL) h2o.getFutureModel(MODEL))
# some house keeping
 h2o.rm(keys.to.clean)
# return the models
 models
}
kf.rf <- h2o.kfold(3,fr,X,Y,rf.one_v_many.futures,ensemble.predict) # ensemble.predict is below

Diving In

Let’s step through what this h2o.kfold method does.
Admittedly the API here is clunky, but it will certainly do the job — I’ll leave API munging as an exercise for the reader!
Briefly the parameters are:

k: the number of folds
training_frame: the dataset to do machine learning on
X: predictor variables
Y: response variable
algo.fun: a fully-specified algorithm to perform kfold cross-validation on
fun.predict: a predict method
poll: if TRUE, then it will attempt to poll future models

In general, fun.predict should be the vanilla h2o.predict method (although more exotic methods are permissible, as hinted at by Example 3 above).

How folds are built:

Many of our R examples make use of h2o.runif to split a dataset into (train,valid,test) tuples:

# some existing dataset
r <- h2o.runif(fr) # builds a vector the length of fr filled with draws from U(0,1)
train <- fr[r < 0.7,]
valid <- fr[0.7 <= r < 0.8, ]
test <- fr[r >= 0.8, ]

We can apply the same thinking to assign fold IDs to each row of our input training data. This is exactly what the first line of h2o.kfold does:

folds <- 1+as.numeric(cut(h2o.runif(training_frame), seq(0,1,1/k), include.lowest=T))

This line performs 3 actions:

1. First it builds a vector filled with uniformly random numbers in [0,1).
2. Next the (extremely useful) `cut` method assigns each random value one of k factor levels.
3. Finally, to get the factor levels as integral identifiers from 1, ..., k we add 1 after coercing the column to numeric (adding 1 because H2O is 0-based).

The remainder of the method is not very interesting, except for the asynchronous launch and polling of models. From the R interface, the algorithm methods may take a special parameter future=TRUE to return a model future object, which can be blocked on at a future time (rather than polling at launch).

Predicting 1-Many Models

Building off of the one-versus-many code in Example 3, then the predict code should look something like

ensemble.predict &lt;- function(models,valid_data) {
 probs &lt;- .binomial.predict.helper(models,valid_data)
 p_valid &lt;- h2o:::h2o.which.max(probs[[1]])
 dim(p_valid)
 res &lt;- h2o.cbind(p_valid,probs[[1]][,-1])
 dim(res)
 res
}
.binomial.predict.helper &lt;- function(models,data) {
 keys.to.clean &lt;- NULL
 threshes &lt;- NULL
 Y &lt;- ncol(data) # assumes that response is last vec...
 res &lt;- lapply(0L:(length(models)-1L), function(ID) {
 d &lt;- h2o.cbind(data, as.numeric(data[,Y])==ID)
 p &lt;- h2o.performance(models[[ID+1]], d)
 t &lt;- h2o.find_threshold_by_max_metric(p, "f1")
 pred &lt;- h2o.predict(models[[ID+1]], d)
 cp &lt;- ifelse(pred[,3] &gt;= t, pred[,3], 0)
 keys.to.clean &lt;&lt;- c(keys.to.clean, d@frame_id, pred@frame_id)
 threshes &lt;- c(threshes, t)
 dim(cp)
 cp
 })
 res &lt;- h2o.cbind(res)
 print(dim(res))
 h2o.rm(keys.to.clean)
 list(res,threshes)
}

This constructs class probabilties for each of the classes based on a threshold computed over the holdout data from the kfold cross validation (it altrnatively takes any input vector of thresholds).

Explore similar content by topic

H2O.ai Team

At H2O.ai, democratizing AI isn’t just an idea. It’s a movement. And that means that it requires action. We started out as a group of like minded individuals in the open source community, collectively driven by the idea that there should be freedom around the creation and use of AI.

Today we have evolved into a global company built by people from a variety of different backgrounds and skill sets, all driven to be part of something greater than ourselves. Our partnerships now extend beyond the open-source community to include business customers, academia, and non-profit organizations.

Generative AI

Predictive AI

Industry Solutions

Use Cases

H2O.ai Hospital Occupancy Simulator

Strategic Transformation

View All Case Studies

FINANCIAL SERVICES

TELECOM

HEALTHCARE

ENERGY

FINANCIAL INDUSTRIES

MARKETING

Partners

Resources

Open Source

Join H2O University

Support

Events

H2O.ai Wiki

Responsible AI

Company

What is an AI Cloud?

2024 Gartner® Magic Quadrant™