Return to page


What Are Training Sets?

Training sets, or training datasets, are example data used in the training process of machine learning (ML) models. They are given to ML algorithms to learn to make predictions and find insights within the dataset. There are two ways to train a model listed below.

Supervised learning

Supervised learning requires training sets to be labeled in order for the machine to learn how the input and output variables are connected. Labeled data means the target (outcome) the model should predict is input by the user. 

Unsupervised learning

Unsupervised learning does not require the training sets to be labeled in order for models to find patterns and make predictions. Using the unlabeled data, models can learn from past predictions. 


Training Set vs. Testing Set

As previously stated, training sets are used to train ML models to make predictions. Testing sets are used to test a trained model. Testing sets should never be used to train a model, as it can lead to inaccuracy. 

Training sets should be a larger portion of the data than the testing set. In order for the model to accurately learn from these sets, it is crucial to provide a significant amount of data.The ratio is heavily debated, but most experts suggest using an 80:20 ratio of training data to testing data. 


Characteristics of Quality Training Sets

Quality data in training sets is crucial for a model to make accurate predictions. Users must ensure that personal bias does not impact the data in a training set. The performance of an ML model depends on accurate and sufficient training sets.


In order to produce accurate predictions, the data in a training set must be relevant to the task the model is performing. ML models can only produce accurate predictions if the training sets are relevant. This could be compared to a model being trained to analyze vehicle registration but only being given the weather in training. 


The data in a training set must be representative of each attribute the model must predict. An example would be a user training a model to recognize hair color in images but only training the model with pictures of brunettes. 


The data in a training set must share an attribute. An example would be a user training a model to recognize names and street addresses but using data that occasionally includes a gender. 


The training set must be complex and large enough to appropriately train the model. 


How are training sets used in machine learning?

ML models use previously learned information to make future predictions. The models are consistently learning and making new predictions. The quality and quantity of training sets will set the tone of ML accuracy. 


H2O Resources on Training Sets

Unsupervised Machine Learning

Supervised Machine Learning