Return to page WIKI

Gradient Descent

What is gradient descent?

Gradient descent is an iterative optimization algorithm used to find the local minima of a differentiable function, usually toward a goal of error prediction. It is often used when values can’t be easily calculated, but must be discovered through trial and error.

Important terms related to gradient descent

Coefficient - A function’s parameter values; through iterations, it is reevaluated until the cost value is as close to 0 as possible (or good enough).

Cost - This is the function itself that is evaluated; gradient descent is used to find the minimum.

Delta - The derivative of the cost function.

Random initialization - The formulaic guessing process used to initialize gradient descent.

Local minima - When the derivative of the function is as close to 0 as acceptable.

True local minima - When the derivative of the function is equal to a perfect 0.

Learning rate -  This is a hyperparameter that controls how quickly models are adapted to a given problem.

Iterations (batch) - An indication of the number of times a gradient descent algorithm’s parameters are updated.


Why is gradient descent important?

Gradient descent estimates error gradient within machine learning models. This helps minimize their cost function and optimize computation time to quickly deliver models and predictions. While there are other optimization algorithms with better convergence guarantees, few are as computationally efficient as gradient descent. 

This efficiency enables gradient descent algorithms to train neural networks on large datasets with reasonable turnaround. Gradient descent is a simple, effective tool that proves useful for straightforward, quantitative neural network training. 


Gradient Descent vs. Other Technologies & Methodologies

Gradient descent vs. Stochastic gradient descent

Compared to gradient descent, Stochastic gradient descent is much faster and more suitable for large datasets. The gradient is not calculated for the entire dataset, but only for one random point with each iteration, so the variance of the updates is higher.

Gradient descent vs newton method

Like gradient descent, the Newton method is ideal when finding local minima. Where the Newton method differs from gradient descent, however, is in its approach. The Newton method finds the root of a function instead of its minima or maxima.

Gradient descent vs backpropagation

Gradient descent is an optimization algorithm for minimizing the error rate of a predictive model with to train a dataset. Backpropagation is an automatic differentiation algorithm for calculating gradients for the weights in a neural network graph structure. These two algorithms are used in tandem to effectively train neural network models.

Gradient descent vs gradient boosting

Gradient boosting compliments gradient descent. Specifically, boosting algorithms are iterative functional gradient descent algorithms in that they optimize a cost function over function space by iteratively weak hypotheses per iteration that point toward the negative gradient direction.

Gradient descent vs gradient ascent

While gradient descent is used primarily for error prediction with a decreasing slope in a model, while gradient ascent is used primarily for improving optimization through an upward slope or increasing graph

Gradient descent vs batch gradient descent

Multiple machine learning algorithms have various aspects of representation and coefficients, but still require an optimization process to find coefficients that manifest in the best estimate of end result. This results in multiple iterations or “batches” of gradient descent, with each batch representing different representations and/or coefficients. Batch gradient descent is the most common form of gradient descent used in machine learning.