- Activation Function
- Confusion Matrix
- Convolutional Neural Networks
- Forward Propagation
- Generative Adversarial Network
- Gradient Descent
- Linear Regression
- Logistic Regression
- Machine Learning Algorithms
- Multilayer Perceptron
- Naive Bayes
- Neural Networking and Deep Learning
- RuleFit
- Stack Ensemble
- Word2Vec
- XGBoost

- Attention Mechanism
- BERT
- Binary Classification
- Classify Token ([CLS])
- Conversational Response Generation
- GLUE (General Language Understanding Evaluation)
- GPT (Generative Pre-Trained Transformers)
- Language Modeling
- Layer Normalization
- Mask Token ([MASK])
- Probability Distribution
- Probing Classifiers
- SQuAD (Stanford Question Answering Dataset)
- Self-attention
- Separate token ([SEP])
- Sequence-to-sequence Language Generation
- Sequential Text Spans
- Text Classification
- Text Generation
- Transformer Architecture
- WordPiece

- AUC-ROC
- Analytical Review
- Autoencoders
- Bias-Variance Tradeoff
- Decision Optimization
- Explanatory Variables
- Exponential Smoothing
- Level of Granularity
- Long Short-Term Memory
- Loss Function
- Model Management
- Precision and Recall
- Predictive Learning
- ROC Curve
- Recommendation system
- Stochastic Gradient Descent
- Target Leakage
- Target Variable
- Underwriting

A

C

D

G

L

M

N

P

R

S

T

X

- Activation Function
- Confusion Matrix
- Convolutional Neural Networks
- Forward Propagation
- Generative Adversarial Network
- Gradient Descent
- Linear Regression
- Logistic Regression
- Machine Learning Algorithms
- Multilayer Perceptron
- Naive Bayes
- Neural Networking and Deep Learning
- RuleFit
- Stack Ensemble
- Word2Vec
- XGBoost

- Attention Mechanism
- BERT
- Binary Classification
- Classify Token ([CLS])
- Conversational Response Generation
- GLUE (General Language Understanding Evaluation)
- GPT (Generative Pre-Trained Transformers)
- Language Modeling
- Layer Normalization
- Mask Token ([MASK])
- Probability Distribution
- Probing Classifiers
- SQuAD (Stanford Question Answering Dataset)
- Self-attention
- Separate token ([SEP])
- Sequence-to-sequence Language Generation
- Sequential Text Spans
- Text Classification
- Text Generation
- Transformer Architecture
- WordPiece

- AUC-ROC
- Analytical Review
- Autoencoders
- Bias-Variance Tradeoff
- Decision Optimization
- Explanatory Variables
- Exponential Smoothing
- Level of Granularity
- Long Short-Term Memory
- Loss Function
- Model Management
- Precision and Recall
- Predictive Learning
- ROC Curve
- Recommendation system
- Stochastic Gradient Descent
- Target Leakage
- Target Variable
- Underwriting

Gradient descent is an iterative optimization algorithm used to find the local minima of a differentiable function, usually toward a goal of error prediction. It is often used when values can’t be easily calculated, but must be discovered through trial and error.

**Coefficient** - A function’s parameter values; through iterations, it is reevaluated until the cost value is as close to 0 as possible (or good enough).

**Cost** - This is the function itself that is evaluated; gradient descent is used to find the minimum.

**Delta** - The derivative of the cost function.

**Random initialization** - The formulaic guessing process used to initialize gradient descent.

**Local minima** - When the derivative of the function is as close to 0 as acceptable.

**True local minima** - When the derivative of the function is equal to a perfect 0.

**Learning rate** - This is a hyperparameter that controls how quickly models are adapted to a given problem.

**Iterations (batch) **- An indication of the number of times a gradient descent algorithm’s parameters are updated.

Gradient descent estimates error gradient within machine learning models. This helps minimize their cost function and optimize computation time to quickly deliver models and predictions. While there are other optimization algorithms with better convergence guarantees, few are as computationally efficient as gradient descent.

This efficiency enables gradient descent algorithms to train neural networks on large datasets with reasonable turnaround. Gradient descent is a simple, effective tool that proves useful for straightforward, quantitative neural network training.

Compared to gradient descent, Stochastic gradient descent is much faster and more suitable for large datasets. The gradient is not calculated for the entire dataset, but only for one random point with each iteration, so the variance of the updates is higher.

Like gradient descent, the Newton method is ideal when finding local minima. Where the Newton method differs from gradient descent, however, is in its approach. The Newton method finds the root of a function instead of its minima or maxima.

Gradient descent is an optimization algorithm for minimizing the error rate of a predictive model with to train a dataset. Backpropagation is an automatic differentiation algorithm for calculating gradients for the weights in a neural network graph structure. These two algorithms are used in tandem to effectively train neural network models.

Gradient boosting compliments gradient descent. Specifically, boosting algorithms are iterative functional gradient descent algorithms in that they optimize a cost function over function space by iteratively weak hypotheses per iteration that point toward the negative gradient direction.

While gradient descent is used primarily for error prediction with a decreasing slope in a model, while gradient ascent is used primarily for improving optimization through an upward slope or increasing graph

Multiple machine learning algorithms have various aspects of representation and coefficients, but still require an optimization process to find coefficients that manifest in the best estimate of end result. This results in multiple iterations or “batches” of gradient descent, with each batch representing different representations and/or coefficients. Batch gradient descent is the most common form of gradient descent used in machine learning.