This blog is also explains the solution to a Google Stream question we received
This is a terse guide to building KFold cross-validated models with H2O using the R interface. There’s not very much R code needed to get up and running, but it’s by no means the one-magic-button method either. This guide is intended for the more “rustic” data scientist that likes to get there hands a bit dirty and build out their own tools.
In about 30 lines of R you’ll be able to build the folds, the models, and the predictions! Here’s the code in all of its glory:
h2o.kfold <- function(k,training_frame,X,Y,algo.fun,predict.fun,poll=FALSE) {
folds <- 1+as.numeric(cut(h2o.runif(training_frame), seq(0,1,1/k), include.lowest=T))
print(dim(folds))
# launch models
model.futures <- NULL
for( i in 1L:k) {
train <- training_frame[folds!=i,]
if( is.null(model.futures) ) model.futures <- list(algo.fun(train,X,Y))
else model.futures <- c(model.futures, list(algo.fun(train,X,Y)))
}
models <- model.futures
if( poll ) {
for( i in 1L:length(models) ) {
models[[i]] <- h2o.getFutureModel(models[[i]])
}
}
# perform predictions on the holdout data
preds <- NULL
for( i in 1L:k) {
valid <- training_frame[folds==i,]
p <- predict.fun(models[[i]], valid)
if( is.null(preds) ) preds <- p
else preds <- h2o.rbind(preds,p)
}
# return the results
list(models=models, predictions=preds)
}
tl;dr : You can start using this right away. Here are three examples:
# 5-fold GBM:
h2o_gbm <- function(training_frame,X,Y) {<br />
h2o.gbm(x=X,<br />
y=Y,<br />
training_frame=training_frame,<br />
ntree=1,<br />
max_depth=1,<br />
learn_rate=0.01,<br />
future=TRUE) # future = TRUE launches model builds in parallel, careful!<br />
}
kf.gbm <- h2o.kfold(5, fr, X, Y, h2o_gbm, h2o.predict, TRUE) # poll future models
# 10-fold Deeplearning:
h2o_dl <- function(training_frame,X,Y){
h2o.deeplearning(x=X,
y=Y,
training_frame=training_frame,
hidden=c(200,200,200),
activation=”RectifierWithDropout”,
input_dropout_ratio=0.3,
hidden_dropout_ratios=c(0.5,0.5,0.5),
l1=1e-4) # no future since each DL has high Duty Cycle<
}
kf.dl <- h2o.kfold(10, fr, X, Y, h2o_dl , h2o.predict, FALSE) # no future models to poll!
# 1-many binomial models with 5-fold cross validation:
rf.one_v_many.futures <- function(training_frame,X,Y) {
keys.to.clean <- NULL
nclass <- length(h2o.levels(training_frame[,Y]))
model.futures <- lapply(0:(nclass-1), function(CLASS) {
tr <- h2o.cbind(training_frame, as.factor(as.numeric(training_frame[,Y])==CLASS))
keys.to.clean «- c(keys.to.clean, tr@frame_id)
h2o.randomForest(x=X,
y=ncol(tr),
training_frame=tr,
ntree=50,
max_depth=20,
future=TRUE)
})
# poll the models
models <- lapply(model.futures, function(MODEL) h2o.getFutureModel(MODEL))
# some house keeping
h2o.rm(keys.to.clean)
# return the models
models
}
kf.rf <- h2o.kfold(3,fr,X,Y,rf.one_v_many.futures,ensemble.predict) # ensemble.predict is below
Let’s step through what this h2o.kfold
method does.
Admittedly the API here is clunky, but it will certainly do the job — I’ll leave API munging as an exercise for the reader!
Briefly the parameters are:
k
: the number of foldstraining_frame
: the dataset to do machine learning onX
: predictor variablesY
: response variablealgo.fun
: a fully-specified algorithm to perform kfold cross-validation onfun.predict
: a predict methodpoll
: if TRUE, then it will attempt to poll future modelsIn general, fun.predict
should be the vanilla h2o.predict
method (although more exotic methods are permissible, as hinted at by Example 3 above).
Many of our R examples make use of h2o.runif
to split a dataset into (train,valid,test) tuples:
# some existing dataset
r <- h2o.runif(fr) # builds a vector the length of fr filled with draws from U(0,1)
train <- fr[r < 0.7,]
valid <- fr[0.7 <= r < 0.8, ]
test <- fr[r >= 0.8, ]
We can apply the same thinking to assign fold IDs to each row of our input training data. This is exactly what the first line of h2o.kfold
does:
folds <- 1+as.numeric(cut(h2o.runif(training_frame), seq(0,1,1/k), include.lowest=T))
This line performs 3 actions:
1. First it builds a vector filled with uniformly random numbers in [0,1).
2. Next the (extremely useful) `cut` method assigns each random value one of k factor levels.
3. Finally, to get the factor levels as integral identifiers from 1, ..., k we add 1 after coercing the column to numeric (adding 1 because H2O is 0-based).
The remainder of the method is not very interesting, except for the asynchronous launch and polling of models. From the R interface, the algorithm methods may take a special parameter future=TRUE
to return a model future object, which can be blocked on at a future time (rather than polling at launch).
Building off of the one-versus-many code in Example 3, then the predict code should look something like
ensemble.predict <- function(models,valid_data) {
probs <- .binomial.predict.helper(models,valid_data)
p_valid <- h2o:::h2o.which.max(probs[[1]])
dim(p_valid)
res <- h2o.cbind(p_valid,probs[[1]][,-1])
dim(res)
res
}
.binomial.predict.helper <- function(models,data) {
keys.to.clean <- NULL
threshes <- NULL
Y <- ncol(data) # assumes that response is last vec...
res <- lapply(0L:(length(models)-1L), function(ID) {
d <- h2o.cbind(data, as.numeric(data[,Y])==ID)
p <- h2o.performance(models[[ID+1]], d)
t <- h2o.find_threshold_by_max_metric(p, "f1")
pred <- h2o.predict(models[[ID+1]], d)
cp <- ifelse(pred[,3] >= t, pred[,3], 0)
keys.to.clean <<- c(keys.to.clean, d@frame_id, pred@frame_id)
threshes <- c(threshes, t)
dim(cp)
cp
})
res <- h2o.cbind(res)
print(dim(res))
h2o.rm(keys.to.clean)
list(res,threshes)
}
This constructs class probabilties for each of the classes based on a threshold computed over the holdout data from the kfold cross validation (it altrnatively takes any input vector of thresholds).