Return to page

H2O.ai WIKI

Underfitting

What is Underfitting?

Underfitting is a term used to describe a data model that is unable to interpret the correlation between input and output variables. Underfitting causes an error rate not only to the training set, but also to unseen data. It most often occurs when there is insufficient data or the wrong type of data for the task at hand.

 

Why is Underfitting important?

An underfit model may not take all environmental aspects into consideration, which would produce highly unlikely and oversimplified results. Decisions should not be based on underfit models, as suggestions drawn from the model are not based on accurate data. For an organization to save on total costs, the data needs to be aligned with the model. 

 

Reasons for Underfitting

There are several reasons why underfitting occurs. The results of underfitting are detrimental to the model and output data. A few of the reasons are: 

The dataset is too small

Building an accurate model requires substantial data. When there is insufficient data, the model cannot accurately gauge what it is supposed to do and will result in inaccurate predictions. 

 

There is a low variance and a high bias

When the data has a low variance, the model will have a low difference of errors. The result will be a model that has a high bias, meaning it has a high error rate. 

The training data is not compatible with the model

The user must ensure that the data they are inputting is compatible with the model they are using. If building a linear model, the user cannot use non-linear data. 

The model used is too simple for the datase

When the dataset is too complex, the model will make inaccurate predictions and will not be reliable.

Noise is included in the dataset used for training

Noise, any form of distortion, in a dataset will cause the model to malfunction.

 

How to Avoid Underfitting

Decrease the amount of regularization used

Regularization, a term describing the process of decreasing noise from a dataset, is useful in most cases. However, problems occur when features in the dataset become overly uniform. The machine learning (ML) model cannot be trained with uniform data, as it results in oversimplified results. 

Increase the amount of training the model receives

Users must ensure that the model is sufficiently trained without being overly trained because there is a delicate balance between overfitting and underfitting. They must ensure that the model has been trained with the proper amount of data for the proper amount of time to receive accurate results. 

Selecting the correct features

Ensuring that enough predictive features are present will produce a model that functions as intended. Without enough predictive features, the model will give inaccurate results. 

Create a more complex model

A model must have enough data to capture patterns. If a user is experiencing underfitting, they may need to add additional data. If the data is limited or too similar, the model will not accurately interpret the data. By adding data, the model will be able to better understand and interpret the dataset. 

 

Underfitting vs. Overfitting

Underfitting occurs when a learning model oversimplifies the data in the set. The results in underfit models show low variance and high bias. 

Overfitting, the opposite of underfitting, is when a model is overtrained and has too much complex data. When a model is overtrained, it becomes overly accustomed to the data and over-analyzes the patterns in the training data. Overfit results will show a high variance and low bias. 

Users know their models are overfit when they perform well on training data, but not on evaluation data. Likewise, users know their models are underfit when they perform poorly on training data. It is essential for users to find a balance between overfitting and underfitting. A model requires sufficient information to run properly without having too much information.

 


.