Target leakage can happen when a variable that is not a feature is being used to predict the target.
Target leakage happens more frequently with complex datasets. For instance, if the training dataset was normalized or standardized by using missing value imputation (min, max, mean). An unseen dataset may not have any knowledge of the distribution of data in the dataset. For example, a row in the unseen dataset would not know if there was one record or one million other records. The dataset would not be able to normalize or standardize itself using unseen data. This results in the model overfitting to the training data and produces a higher accuracy when run on unseen data than it would have if the model had not fit the training dataset.
Try to think about target leakage occurring over time when new data becomes available and changes existing data used in the model.
Consider the following example. You want to predict who will get sick with a sinus infection. The top few rows of your raw data look like this:
Most people tend to take antibiotics after getting a sinus infection so they can recover. This raw data shows a strong relationship between got_sinus_infection and took_antibotic, however, took_antibotic is commonly changed after the value for got_sinus_infection has been determined. This results in target leakage.
Target leakage is a consistent problem in data science and machine learning. It can cause a model to overrepresent its generalization error, making it useless for any real-world application.
Target leakage is particularly challenging because it can be either intentional or unintentional, making it difficult to pinpoint. Kaggle contestants have been known to intentionally include sampling errors that result in target leakage as a way to develop accurate models and gain competitive advantages in data science competitions.
Try to avoid target leakage by omitting data that is not known at the time of the target outcome.
Similar to what’s been stated above, leaks in predictive modeling can cause predictive models to appear more accurate than they are. This can range from overly optimistic to completely invalid. The cause tends to be highly correlated data where the training data contains information you are trying to predict.
Label leakage happens when information that is not available at the time of prediction leaks into the training set used for model training. This results in models that appear perfect on paper but are useless in practice.
H2O.ai and Big Data: H2O AI is a platform that helps data scientists apply Big Data to their datasets much faster. H2O allows data scientists to get past the technology layer that changes daily and get straight to making, operating, and innovating with AI. As a result, businesses can innovate faster using proven AI technology. H2O.ai enables teams of data scientists, developers, machine-learning engineers, DevOps, IT professionals, and business users to work together with the same toolset toward a common goal.
Driverless AI runs a model to determine the predictive power of each feature on the target variable. Then, a simple model is built on each feature with significant variable importance. The models with high AUC (for classification) or R2 score (regression) are reported to the user as a potential leak.
To find original features that are causing leakage, have a look at features_orig.txt in the experiment summary download. Features causing leakage will have high importance there. To get a hint at derived features that might be causing leakage, create a new experiment with dials set to 2/2/8, and run the new experiment on your data with all your features and response. Then analyze the top 1-2 features in the model variable importance. They are likely the main contributors to data leakage if it is occurring.
In Driverless AI, we take special measures in various stages of the modeling process to ensure that overfitting is prevented and that all models produced are generalizable and give consistent results on test data.
Since Driverless AI thrives in making very predictive models, it has built-in measures to control overfitting and underfitting.
Driverless AI will throw warning messages if some features are strongly correlated with the target but typically does not take any additional action by itself unless the correlation is perfect.