ON DEMAND

Accuracy Masterclass Part 2 - Validation Scheme Best Practices

Setting up a validation strategy is one of the most crucial steps in creating a machine learning model. A poorly designed validation scheme can lead to a major model accuracy overestimation or even completely erroneous model. There are common concepts of how to set up a validation strategy and what are the typical mistakes to avoid.

3 Main Learning Points

Learn why it is important to do validation for machine learning
Overview the common validation methods
See examples of good and bad validation approaches

Read Transcript

I'm going to talk about validation schemes as a part of our second accuracy master class session. So we'll start with some basic key concepts and some basic notation I will use across the slides. And then we'll dive deeper into validation schemes. What types of validations? are typically there in the literature? And what are typically, what are the schemes that are typically used in practice based on my experience, and then we'll slowly move into more practical things like informal things that that are quite popular out there. Like what is data leaks, why it's important. So what do we what do we call a data leak? I'll give you a couple of practical examples, some from the literature. But mostly, I will focus on some examples from the Kaggle competition, some more practical ones. And in the end, I'll try to cover some more comprehensive topics related to assembling and kind of more comprehensive validation schemes. Let's start with some key con key concepts, some basics, I will introduce some simple notation and some terms, I'm going to be using cross presentation just forever, for every one of us to be on the same page.

Today, we'll be talking about models, metrics, datasets, and evaluating the models. Let's start with the models. The validation topics are relevant, relevant for pretty much all the models out there, I will focus on all focus on supervised machine learning models. That includes typical tabular models, linear regression, generalized linear models, trees, GBMs, random forests, and many others. Neural networks also actually heavily dependent on properly validating them and properly setting up the validation schemes. We'll touch a little bit time series models. So forecasting models, and few picks bicolor issues that might arise when you're training one. And actually, it can be applied to more models out there. But I would like to stress that even such like basics concepts, which are usually covered by every single machine learning book, they're, they're very important, they weren't important before, they're still extremely important. Even if you're doing a lot of deep learning. Even then, like the basic principles still hold, and the topic is still very much important to follow to be aware of and to master. Second notion is a dataset. It's kind of a crucial one. Usually in the literature, we describe a data set as a set of pairs, x and y, where x is the predictor and y target variable. And our model is just a function which tries to predict or reconstruct the target y given the x as close as possible.

In order to assess how well our model is doing that we usually introduced metrics, and we had a previous session focused on different types of metrics, depending on the type of the model. Here, I will just, I will not focus on it that much. We'll just say that metric is always something you focus on when you are training model. And metric is the function basically, that you will use to assess and pick the best model or judge which one, which one is better, which one is more performance out of the models you have to choose from. And usually it's just,

it is described as a function of two values, why? And the prediction of the model, which describes how well the prediction matches the reality. So how close the prediction is to the true target fair value. But of course, looking at like a single value doesn't make much sense. It can be very, very unstable. So whenever we talk about evaluating the model, it's actually putting together three, three terms model, the metric and the data set. Model evaluation never is never abstract. It is always connected to a data set. And a model that performs very well on one data set might perform really poorly on another one. So whenever I'm going to be talking about model evaluation, I'll be stressing that the model is evaluated against a metric on a specific data set. And usually, I think almost all the metrics are just averages Have the metric values over all the records in the in the data set. Moving on model hyper parameters, it's a very commonly used term out there. I believe there is no strict definition of what is the hyper parameter that I will be using. Using this term. In the sense that it's a model hyper parameter is a set of parameters that define a family of models. So whenever we talk about a specific type of the model, we will also introduce a set of the parameters, which would define how the model is going to look like some of them will drive the model complexity, some of them will not, will not and the choice of the hyper parameters, it's quite judgmental, too. Because some of the things might be fixed, given the task, or given the type of the model or by choice of the data scientist. Depending on the type of the model, hyper parameters might be very different for a linear model. Usually, if it's a generalized linear model, you can pick a length function, you can decide to fit or not to fit intercept, you can choose optimization method, if it's, say, logistic regression. And decide the number of steps of the optimization routine, we're more commonly pick the regularization parameters, so you can choose to go for L one and l two, regularization or both. And then, depending on the values of the parameters of regularization, you will be getting different models using the same training dataset. For trees, there are way more obvious parameters. So like that size of the leaf, how you choose the features, whether overall, the sample or random sub sample folder features, every time you create a split, and the criterion to split, and so forth. There are so many of them. But when we move to GBM, or random forests, we actually add even more, because we're building a forest. So on top of all the tree parameters, we can introduce way more hyper parameters here. And with neural networks, it gets even more overwhelming. It's it is very large dependent on the network structure. So if it's a full, fully connected layers, then it can be just network depth and width. But if it's a recurrent neural network, or some convolutional neural network with custom structure, you can parameterize it and use some parameters describing the structure of the neural network as hyper parameters. And a few more typical ones, like learning rate learning schedule, optimization methods and number of epochs. But a very important point to keep in mind here is that just the model itself is not it. Usually, what suppose is supposed to be treated as a part of the model is the whole pipeline of preparing features. So things like feature selection, feature transformations, and feature engineering, they could and typically should be part of the model and treated sometimes as hyper parameters. So we can introduce an automated feature selection routine, which inputs a couple of parameters and use them as the extended set of the model hyper parameters. And it's a, it's a very valid approach to follow.

Now a little bit more about validation. So that's a very classical picture to keep in mind. So whenever we start working on a problem, we get a data set, we usually take a pair of scissors and start cutting it into pieces. Typically, we do like three notions of three sub datasets, training, validation, or a test or holdout. And usually, they're used for three different purposes. So the training part of the original full data set is used to run the optimization routine to build a model given the hyper parameters set. So if we're training, logistic regression, we're using it to run the optimization routine to find the optimal parameters of the linear model. If we are training decision tree, then we're growing a tree and we're using the training data set given the hyper parameters which defines the size of the train depth of the tree, and we built we build the structure of the tree. And that gives us the model. If we talk about neural networks, here, we given the hyper parameter values, we define the structure and then use training to optimize the model weights to the training sample. So training data will give us, given the hyper parameter values will give us a model, a single model. How do we use validation dataset is to find the best model? So, given a set of the models we produce out of the family, we usually choose the best one, which is the model which has the lowest error out there. And last but not least, test dataset word test sub dataset as we usually define it on our own is another data set, we use to evaluate the chosen model. So given the final choice of the model, we use it to apply to the dataset in order to know how accurate this model is. And there is a commonly different name for the test dataset holdout to emphasize the fact that we put it out, we put it to the site and never touch it until the very final model is ready. And we use it only to assess how accurate the model is. That poses a question why so much work? Why not use the entire dataset? Why do we need to have three? A typical picture as this one, I'm basically copying a picture from the elements of statistical learning. What would happen if we would not do a holdout sample which would not use a test sample and just rely on training sample and fit a model in measured the accuracy of the model and pick the best model just using the training sample? The issue might arise and typically it does arise when we increase the model complexity. So in the extreme case of the modern models, such as GBM, sir neural networks that have a lot of parameters and lots of degrees of freedom, especially, especially in neural networks, which nowadays can have billions of parameters to tune, they can they have very high complexity, that gives them the ability pretty much to overfit. And to some extent, memorize the training sample, meaning that the longer we train it, the better the error will be on training data, always. But if we have a test data set out there, then we will capture the point where the so-called overfitting happens, that's the point where the model actually stops, performing better on the outside data and just starts tuning itself to the training data more than generalized on the general population of future or future records. In order to capture this overfitting in order to find the best model out there, instead of training endlessly, this notion of test sample was introduced, and test sample is with a test sample, we can find the optimal here, we see it somewhere in the middle of the picture. So that's the optimal model complexity would be the one that gives us the lowest test error.

And at the same time, we can see that the training error will be still usually lower, and for very complex models that can be much lower than the test error, which is supposed to represent the real model accuracy out there. The next question is what is the validation sample here? Where does it lie? Like from the practical experience, I would say depending on the setup, depending on how well it is defined, it can be somewhere in between training and test, the better it is setup, the closer it is to test. But given certain circumstances, it can actually drift a little bit in one direction or the other. A simple example if is if you have a very, very large number of models you fit and you compare on the validation sample, then the validation sample is implicitly used as an optimization routine itself and then the value of it kind of diminishes over time and the further away it gets from the test sample. But we'll I'll give a couple of examples of those to the end to show how the validations can actually ruin things. And validation can lead you towards more overfitting model than getting the model that performs the best on the test data. Okay, so moving on to the cross validation, this is pretty much the standard out there. So this is the most commonly used way to setup and validation for your model with of course, certain exceptions depending on the size of the data size of the model, and other factors that might drive your decision. So the very first thing to do is to define a holdout sample. So something you would put aside and use it only to ensure that the final model you have built is robust enough and to assess its accuracy on the data set. It has never it has never seen before. And that has never been used in either for model training more for model selection. The remaining parts, we can refer to as kind of have the full training sample even though it's not a full sample. But we assume that holdout is not accessible for us until the very end of our journey as a as a data scientist, building a model. So what do we do with a full training data set is we split it into parts of the equal size? In this example, we define K, the parameter of the K fold cross validation equal to three, that would mean that, generally speaking, we cut the full training data set into three equally sized pieces. And we run the model training routine three times every time, leaving out 1/3 of the full training as validation and training the model only on the remaining two thirds. And we're repeated three times to cover all three non-overlapping pieces of the full training. And this way, we build three models. Basically, we run the full training routine three times, in generally speaking, we usually, we usually do it with the same set of hyper parameters. So we defined set of hyper set of values of the hyper parameters, and just repeat model training routine three times just using different training subset and a different validation subset. M. K is equal to three in this example, but typically, you would see in the literature in practice k equal to five or 10. But it, it raises a couple of questions, including the question how to choose k. So there are pros and cons. Let's start with the cons. First, we notice that we use only fraction of the full training dataset. So every time we train a model, we drop, dropped 1/3 If case five, we drop 20% of the data set. And that creates kind of a difference between the data set we'll use for final model training on the whole data versus the model we're going to be validating which will be trained only one part of part of it. Another thing to keep in mind is that it takes a lot of training time. So especially if the model is computationally model training is computationally intensive, that can be an issue. So if k equal is equal to five and we want to do fivefold cross validation on the large neural network, we will require five times more computational resources required so if it took to QD to fit the model, then it will take you five days to finish fivefold cross validation and sometimes it drives the decision it towards like having a more simple cross validation scheme typically being similar to having a holdout sample where you run this cross validation only for one of the five folds and rely on that revelation only just to save computational time and half the price of not having the full cross validation and not having the benefits of it. How to choose k. I have never seen any, any strict rules, any strict proposals. So typically, it's either five or 10, just for the sake of not dropping too much data. So the larger K would mean that the difference between the subset and the full set is growing smaller. But the large K would, would require more computational resources to do to fit the model. So usually key five is a good balance. But if you, if the model is fast, then we can go all the way to k equal to 10, to fix to kind of compensate for the first potential issue of running K fold validation. But these are the price to pay to gain actually quite a lot with a with a properly ran K fold cross validation. The first thing is that we're using the full training data set for validation. That would give us first of all more robust model evaluation. So instead of evaluating it on just a fraction of a data set, we're actually evaluating it on the full training dataset. And that's as much data as we have available. Besides that, having the full training used for validation will enable us to add to use model stacking, so I will cover it closer to the end of the presentation. But generally speaking, that's the ability to build a next model where the predictions of the first model are the inputs. So that will give you the ability to build some pipelines of models in the future. And sometimes it is worth doing. So if the accuracy of the results is more important than, say, having a simple model and interpretable model or fast model.

And last but not least, it is often the case that the holdout sample you define in the beginning is quite small, it's also usually a fraction of the whole data set you have at your disposal. So having an alternative large data set to validate the model, when the K fold cross validation is done properly, it's a good alternative to looking at simply test dataset-based accuracy.

Now I'll talk a little bit more about what practitioners called data lakes, that's a very informal term, but the very generic one, usually used to describe some data driven issues or cases or situations that cause the monolithic population to be biased. And that can sound as not such a big deal, but it can cause lots of problems. First, if you have a data leak, you will very likely overestimate the accuracy of your model. It might be not that severe. But it might be also quite a quite a significant one, depending on the size of the leak. And if the model accuracy drives your business decision, that can become a problem. So for instance, if you're building a classification model, and you expect that you need certain accuracy of the model, to make sense out of applying the model to business, then simply overestimating, it may break your use case and basically drive the whole exercise you're doing non profitable, and you will most likely just drop the model. But just having the accuracy, overestimated might not be the main issue. Sometimes due to data leak. The model not only reports very high accuracy, but also focuses on something which it's not supposed to focus and overfit and dramatically. dramatically over fits to something and when you apply it to the day-to-day business, you get a very, very poor performance. And last but not least stacking failure. It's very crucial for stacking. So the building pipelines of the models where the output of one model serves as the input for the other. It's very important to do to do everything correctly and not to introduce any biases because otherwise, garbage in garbage out. rule will apply for the second model. So if you're if you're supplying leaky or biased predictions of the model one and model two will of course treat them wrongly and the stack you built will not make a difference and will probably be worse than the original models. So data leaks are a very formal term. Usually, you can hear it describing multiple, very different use cases. Here, I'm going to give you a couple of examples, but it's by no means an exhaustive list of what might go wrong. And what can cause data leaps. First, simple ones, training records are used and validation. That happens unfortunately, more frequently than we would like to see. Sometimes unintentionally, sometimes. Sometimes, due to the data processing routines, you might find training records in test or validation. And that will drive your accuracy estimation quite high, it will not hurt model performance usually, but it might cause quite a lot of issues. Second, which might sound even sillier target variable is used as a feature, but that happens also quite frequently, and you should be very cautious. Sometimes you have target variable either explicitly, but more often, of course, implicitly in your training data, so some target variable derivative, some derived variable from Target or something very closely correlated to the target variable that you're not intended to use as a feature might get in there. And that will cause again, the same issues. Third, future data used when also accidentally, you might get some of the data fields which are not available at the time when you plan to apply the model. So when you apply the model, actually, these features will be either not available or even riskier, they're not going to be populated the way you expect them to be. So, in the latter case, that would mean you will, it will come a notice because model will not fail, it will receive these features, they will be just not populated the way you would expect them to. For instance, if you're if you're building some credit scoring models and you and you have feature describing the behavior of your customer, you would like to have this behavior to be coming from the data, you know, up to the moment of the model application and not from the future. But if you have a leak, so one way or another you get this information into your training data sample, you might not notice it and when you move them out to production, is it just going to be a dramatic drop in the performance that you will need to investigate? All the data to the last examples can be aggregated under the category. The input feature is not available at inference time. But it's kind of a broader category to pay attention to. Some other examples, which are a little bit trickier, because they're not exactly out there. And sometimes data scientists might not be aware if the content of the data is not known well. There are records that are correlated between training and validation. For instance, credit scoring, for example, if you have data records from the same customer just collected a different time point, they can still find their way into the data set. It's an often it makes sense to do so. But they're quite correlated. That means that the training records are not exactly in the validation, but correlated ones are, and you might want to do something about it. Otherwise, your model will be focusing on memorizing some of the customers and recognizing the features that would and memorizing features that will recognize rather the customers and their behavior. Another example chest X rays. So if you have a medical neural network, which recognizes chest face abnormalities, and typically you have multiple X rays from the same patient, done over a period of visits into the hospital, and if X rays of the same patient are both in training and test data set, the model might actually focus on recognizing the customer from the structure of the bones rather than abnormalities it's supposed to find. And last but not least, checking the validation metric too many times. That's how I personally put it the example I described before so if you have too many models you would like to choose from and then your validation. model selection becomes an optimization function of itself. And I'm going to have an example in the end of how it can actually hurt. So how to fix some of these leaks. The very first and basic examples stratified cross validation, it's applied to, to improve some of the statistical features of the cross validation, as well as to fix some of the leaps. The first one is the first example here is stratifying, by either a target variable, or some of the important predictors, because if we just do random cross validation, and we look at the distribution of the target classes, or distribution of some important predictors, it might be quite unstable, especially if we have relatively small sample, or if we have quite a lot of classes there. And that can bring some issues. One of the issues instabilities of the validation scores. For instance, if one of the classes is particularly difficult to predict, and one of the validation splits has the majority of that class, while the others have lower district lower contribution of that class, you will have samples with quite a large difference in the air, which would like we would like to probably correct in some way. And the easiest ways to do stratification, that's a technique that allows you to preserve as much as possible statistical distribution of the chosen variable, usually one or two, usually, it's impossible to stratify by a lot of variables due to curse of dimensionality, basically. But it can pick the target distribution or distribution of a key predictor to preserve to preserve it over the validation metrics making it more stable to assess. But of course, you need to keep in mind that that might be actually a risky thing, because if you apply this model to a new data and in the new data distribution will be quite different, then you're going to be introducing some sort of a leak here. So you'll be trusting overly this distribution you assumed to be there. But it's going to change over time. The second method is stratifying by some features that cause correlation between the records, like in the previous examples, by customer ID or patient ID, that would allow that will force the cross validation to pull all the records from the same patient or the same customer into one validation split. So that would eliminate a very tricky leak, which might not be easy to, to investigate. But it's an easy, it's an easy way to compensate for that. The third technique is cold rolling window cross validation. And it's a very popular one, it is a way to make your test and validation as well as close as possible to the production application in case you have a time in some form or shape, affecting the results. This is typically in most for time series forecasting. But the same approach can be applied also to classification, typical classification and regression tasks as well. Credit scoring being one where we predict some binary classification, some binary outcome, but we know the time when the customer came to the branch of the bank. And we know that the time will have an impact on the model because the economy changes over time. And whatever we experienced in the past might not be exactly like this in the future. So in order to simulate the same behavior, as we will, as the model will experience when moved to production, these this way is to do this rolling window cross validation. So we every time we take the part of the data from the latest period, and we take the data from the period before that as training data, so we validate from the future data and use the training data from the past. Move back and repeat that a few times to have multiple validation samples to average across. Okay, now moving on now I'm going to move to a couple of practical examples. Most of them will come from Kaggle competitions. And I'm going to start with them with one of my favorite competitions and a phobic dateable competition. And this one, we're expected to build a model that predicts an outcome of a specific NFL Play during game. And this competition was a great example of how a good cross validation setup can pay off. This picture depicts the correlation between cross validation, my team setup and the tests, we were able to check during the competition. So the cross validation, as you can see by correlation was very, very robust. It had some tricks we applied including stratification because we had plays from the same games in the sample. So we did cross validation, stratified by the game, the test sets provided by the NFL was built using a rolling window. So the test set was coming from the different season of the games while we had training data set from the previous seasons.

And here, you see, I guess, approximately 40 models we evaluated on tests during the competition, we have disability to submit some of the models to check the test performance. But the very important point here is that we work quite heavily. And I think we built around 400 models, so 10 times the number of points you have here. And this picture just confirmed us that we can fully trust our cross validation, meaning that for all 400 models, we were very much confident with how well they will perform on the test data, which is coming from, from a completely different set of set of the games. And as you can see, the points are very nicely on the line, which is very close to, to the line matching discourse. Exactly, actually, the test scores were a little bit better. But most importantly, there were very well correlated. So the lesson here is that a very well-built cross validation can give you a lot of confidence in what you're doing. And even without looking at test data, which are supposed to do only once. In the end, you can be very confident with how well your model will perform in the future, which is usually the key or usually the key reason to do model validation in the first place. And next example, is from another Kaggle competition. It was hosted by Los Alamos National Laboratory, where they collected the data for the official earthquakes. In a nutshell, the data was collected from a laboratory device where are they put a stress on particles, and they measured the acoustic signals, so they have a bunch of microphones on the device. And with increasing stress, they record the sound of some of the particle’s braking, this way, simulating the upcoming earthquake. And by the time when the stress is too much the whole the whole layer of particles breaks causing the laboratory earthquake. So for this competition, we're giving a sequence of this audio recordings over time. And we're expected to predict what is the expected time until the earthquake happens or in laboratory terms until this layer breaks. As you can see on the right-hand side, the lower graph here is the acoustic signal and the top graph here is the basically the time until the invert time until the earthquake. That's how the training data looked like. And surprisingly, this competition turned out to be quiet, quite not what we expected. And building very large model or generating a lot of smart features out of the acoustic signal didn't really work. What worked was tricky or smart validation scheme. So I would say that at least half of the success of was setting up the validation in a smart way.

The thing was that we didn't have that much data. So we have just a handful of laboratory earthquakes, I think we see all of them over here, little less few than 20 earthquakes. And we have just a handful of earthquakes in the test data. But we have the acoustic data of the test set available for us to make predictions. So the trick we applied was to generate only a handful of important features from the acoustic signals. So instead of generating hundreds, we generated only four. And we use the Kolmogorov. Smirnoff as a metric to assess how far the training sub samples are from the test sub samples. And what we did, we actually created the validation data, which looks as closest test data as possible by means of throwing out anything which doesn't look so. So the validation scheme that worked really well was subsampling, the original training data and validation data to make it look as close as possible to the test data. So the model was built and validated on the data which, which was closer to the data it was assessed on. It is a very Kaggle looking trick. But it can be used to some extent out there for business applications. Sometimes, because sometimes you have the ability to look at your future test data, where at least some samples of that data if you don't, for the especially for the cases when it takes quite a lot of time for you to learn the true target. Again, a simple example of credit scoring, where it usually takes 12 months to get the true target variable for a single customer. But you know, customer features today. So if there were some distributional changes of your customer base, you can already adjust your validation or test dataset or maybe even training dataset to compensate for that and to focus more on the on your current customers when you're building a model. A third example is actually also coming from the mountain of the elements of statistical learning. This is what was called in the book the wrong way to do cross validation. But it's something to keep in mind. And I'll have a more practical example around the area. In this example, we consider a classification problem with large number of predictors and a following strategy. We screen the predictors; we find a subset of the good predictors that show good correlation with the target. And after we do that, we throw the remaining ones and would build cross validation and tune the parameters and built the model. There is a very crucial flaw in this strategy. And in the book, they describe an example of simulation of exactly the strategy where the target variable and the predictors are, are generated artificially. So basically, we're trying to predict a variable using noise as predictors. And they show that if you apply such a strategy, instead of getting 50% error, as you would expect with having noiseless predictors, you can report I think they get 3% error for such a model with quite a large number of predictors, which is, of course, a very unrealistic estimation and the right way, so the right the right way to do cross validation should give you the right results, meaning the arrow from 50 to 50%. And the right way would be to screen the predictors and to find some set of good predictors as a part of your cross-validation steps. So, if you do before you do cross validation or before use, you define your holdout sample, then you will get the wrong results very likely, especially if you have large number of predictors. And one more example, from Kaggle as well. It is a competition that took place around three years ago by Santander where the bank wanted the competitors to predict customer transactions, unfortunately, was quite heavily anonymized. So we didn't we still didn't know what was behind the target and the predictors, but what were given as a binary classification task with 200 anonymized features that were very much correlated, and obviously, it was done artificially. Quiet, quite early in the competition, it became clear that a successful approach is not trying to build a single model that uses all 200 features, but rather build 200 individual models using single feature each. Why? Because while the intuition told me that most of the techniques like GBM neural networks, they focus a lot on finding interactions between the features, especially tree like, once, and as soon as zero correlation is given, the model focuses a lot on finding some correlation, which doesn't exist there. So building individual models, one per feature 200 times and using product of the predicted probabilities actually gives quite better results. But that that was not it. We played a lot with different validation techniques for this particular use case. And the two options of how to run it give quite an interesting result here. So option one, the one we applied in the end, is we optimize the set of the parameters, which we apply to all 200 models. So the same set of values applied 200 times to build 200 models. And we optimize this one vector of parameters for all models at once. That gave the gave us 92.5. Our work on cross validation in 92.2 on tests. So currently, there was a sign of an overload of a small overfit of 4.3%. But it was well correlated with tests. So we proceeded further. However, we ran another experiment where we did something more feature specific or more individual model specific, and something that sounds as a simple and smart thing to do. Instead of, instead of freezing the set of the vector of hyperparameters, were actually added simply an early stopping for each of the 200 models. So the only difference from the option one was not having a fixed number of trees and GBMs. But to have it single model specific and based on the validation. So for each of the 200 models, we ran training, validation. And we optimized the number of trees based on the validation. And the result was the opposite of what might have one might have hoped for. The cross validation was why it's much higher 92.7%. And in terms of the competition metric, it was quite a large boost. But the test score dropped 92.1 meaning that the difference between cross validation stat in test grew by a factor of two. So by doing that, by doing by adding on only early stopping as an additional sort of degree of freedom to this approach, we actually increased overfitting, we overestimated the model and made the model perform worse on hidden data.

Okay, in the last few minutes, I would like to talk a little bit more about the assembling and importance of the validation for the assembling just a couple of a couples of things to keep in mind and things we use a lot. When we work with the, with the data, especially for when working with complex models. First thing I want to mention is full fit verse bag of cross validation fits. The thing is that by default, usually, after we optimize the hyper parameters, and we find the best vector, we just refilled the model on full training dataset. It works nice if it has a single model, which is fast. If it's an interpretable model, then it allows you to interpret the outputs. But sometimes an alternative is actually taking a look at the three or K models, generally speaking, that you build during the cross validation and instead of refeeding. One more on the full training data set. You just define your final model as a bag of the Okay models you've dealt. So your final prediction will be just an average of the predictions of individual of the cross validation fitted models that will allow you not to run a full fit. And that actually would give you maybe even a more robust results than a single model usually performs equally, but sometimes I believe it performs a little bit better than training a single model. And moreover, it might give you a little bit more trust in the final model, because for each of the model of the models in this bag of models, you know, your exact validations for and therefore you can trust, there was no issues during the training. For some maybe technical reasons, maybe some routine might break during the training of the full fit. So here, you know, that everything worked fine. And validation was, was okay for these models. Second thing is tracking, it's a technique, which is very popular to squeeze the most accuracy for tabular models. But in order for it to work, you need to do a careful cross validation. If you do it carefully, then you can generate a new data set. Where instead of feature instead of original features, you use the predictions out of full predictions of your models that we built, say multiple GBM, multiple neural networks. And then you get new data set of the same size with the same target but new features. And you can build another model on top of that. So that'll build you sort of a pipeline where the input of the second layer of models will be the outputs of the first layers of models. And one more thing, just to keep in mind that nested cross validation is something you might consider doing if you if you find yourself struggling with a leak or you want to build a very robust pipeline and build a large stack of the models. In a nutshell, it's nothing but running cross validation, winning cross validation. So if you do cross validate and cake fold cross validation, that would mean you will build all in okay squared number of models, but that can compensate you from introducing a model selection bias. Okay, this is this is it. Thanks a lot.

Generative AI

Predictive AI

On-Premise Platform

Managed Cloud

Hybrid Cloud

Industry Solutions

Use Cases

H2O.ai Hospital Occupancy Simulator

Strategic Transformation

View All Case Studies

FINANCIAL SERVICES

TELECOM

ENERGY

MARKETING

Partners

Resources

Open Source

Join H2O University

Support

Events

H2O.ai Wiki

Responsible AI

Company

Submit AI 100 2025 Nomination

2025 Gartner® Magic Quadrant™

H2O AI 100 2024

ON DEMAND

Accuracy Masterclass Part 2 - Validation Scheme Best Practices

3 Main Learning Points

Read Transcript

Why H2O.ai

Products

Resources

Insights