Return to page

H2O.ai WIKI

Cross-Validation

What is Cross-Validation?

Cross-validation is a method used to determine if a model is accurately interpreting and predicting data. This method allows for training and testing different properties in a dataset and tuning model hyperparameters. Cross-validation evaluates and compares learning algorithms by dividing a data set into two segments: one used to train the model and the other used to validate. A model is considered accurate if it can correctly adjust to the validation set.

 

Why is Cross-Validation Important? 

Cross-validation is used to assess the accuracy of a model. To evaluate the performance of any machine learning model, it must be tested on unseen data. Based on the performance of the unseen data, the model can be identified as either under-fit, over-fit, or well-fit. 

 

Over-fitting is a problem that occurs when the machine learning algorithm performs differently on training data than on unseen data. Under-fitting refers to a model that does not perform well on the training data or generalize to the new data. When the model makes predictions without error, it is well-fit to the data. 

 

Types of Cross-Validation

There are several techniques to perform cross-validation, the following are the most commonly used methods.

k-fold Cross-Validation

k-fold cross-validation is a technique used to evaluate ML models on a limited data sample. k-fold refers to the number of folds the dataset is split into. This method is the most widely used as it generally results in a less biased or optimistic estimate of the models accuracy

The procedure is as follows: 

  1. Shuffle the dataset randomly

  2. Split the dataset into k groups. The value for k can be calculated or chosen, to learn more click here

  3. For each unique group: 

    1. Take the group as a holdout or test data set 

    2. Take the remaining groups as a training data set 

    3. Fit the model on the training data set and evaluate it on the test set 

    4. Retain the evaluation score and discard the model 

  4. Summarize the skill of the model using the sample of model evaluation scores 

The results of a k-fold cross-validation run are summarized with the mean of the model skill scores. The averaged score will determine the accuracy of the model.

Nested k-Fold Cross-Validation is an approach to overcome bias by nesting the hyperparameter optimization procedure under the model selection procedure. 

The use of two cross-validation loops is also known as double cross-validation.

Train_Test Split Approach 

Train_test split is a model validation procedure that simulates how a model performs on new and unseen data. As previously mentioned, datasets can be split into training and testing data. The training set contains a known output and is used to train the model. The test dataset is used to validate the model’s prediction after training. Here is how the procedure works: 

  1. Arrange The Data

    1. Make sure your data is arranged into a format acceptable for train test split. 

  2. Split the Data

    1. Split the data set into two pieces — a training set and a testing set. This consists of random sampling without replacement, using the majority as the training set and the minority as the testing set. A general rule is 75/25. 

  3. Train the Model 

    1. Train the model on the training set. 

  4. Test the Model 

    1. Test the model on the testing set and evaluate 

Leave-one-out cross-validation (LOOCV)

Leave-One-Out Cross-Validation, or LOOCV, is used to estimate the performance of a model's predictions. This procedure is most appropriate for a small dataset. LOOCV is a configuration of k-fold cross-validation where k is set to the number of examples in the dataset. It requires one model to be created and evaluated for each example in the training dataset. This provides a robust estimate of model performance as each row of data represents the test dataset. Once models have been evaluated using LOOCV and a final model and configuration are chosen, the model is then used on all available data and used to make predictions on new data.

Leave-p-out cross-validation (LPOCV)

In this approach, the p datasets are left out of the training data. If there are a total of n data points in the original input dataset, then n data points will be used as the training dataset and the p data points as the validation set. This process is repeated for all the samples, and the average error is calculated to determine the effectiveness of the model.

 

Cross-Validation Considerations

Some cross-validation techniques may be more appropriate in certain situations than others. In choosing a specific cross-validation procedure, one should consider both costs (eg. inefficient use of available data in estimating regression parameters) and benefits (eg. accuracy in estimation).

 

In complex machine learning models, the same data can inadvertently be used in different steps of the pipeline. This can lead to inaccurate results and problems within the models. The selection of the proper method and the correct use of data can ensure errors are found and corrected for a more accurate result.

 

Cross-Validation Resources with H2O

H2O Docs: Cross Validation