Data science & AI · Reference

What is gradient descent?

Gradient descent is an iterative optimisation algorithm that minimises a function by repeatedly taking steps in the direction of steepest descent, widely used to train machine-learning models by reducing their error.

The idea of stepping downhill

Gradient descent imagines the function being minimised as a landscape and seeks its lowest point. At the current position it computes the gradient, which points in the direction of steepest increase; it then takes a step in the opposite direction, downhill. Repeating this gradually reduces the function's value. In machine learning, the function is a loss that measures prediction error, and the position is the set of model parameters, so each step nudges the parameters toward lower error.

The learning rate

The size of each step is governed by the learning rate. Too large, and the algorithm can overshoot the minimum and fail to converge, oscillating or diverging; too small, and training is needlessly slow.

Choosing the learning rate is one of the most important practical decisions in training, and many refinements — such as schedules that shrink it over time, and adaptive optimisers — exist to manage it. The learning rate is a hyperparameter, set before training rather than learned from data.

Variants of gradient descent

Computing the gradient over the entire dataset at each step (batch gradient descent) is accurate but slow for large data. Stochastic gradient descent estimates the gradient from a single example, and mini-batch gradient descent from a small subset — the standard choice for training deep networks. These noisier estimates are far cheaper per step and often help escape poor solutions. Gradient descent combined with backpropagation, which computes the gradients efficiently, is how neural networks are trained.

Gradient descent in research

Gradient descent underlies the training of most modern machine-learning models, so its behaviour is itself a research topic — including convergence, the role of learning-rate schedules, and why it finds useful solutions in the very high-dimensional, non-convex loss surfaces of deep networks. For reproducibility, the optimiser, learning rate, batch size, and schedule are part of the method and should be reported, since they materially affect the model that results.

Key facts

At a glance

Definition: iterative optimisation by stepping down the gradient
Direction: opposite the gradient (steepest descent)
Step size: set by the learning rate
In ML: minimises a model's loss function
Variants: batch, stochastic, mini-batch
Pairs with backpropagation to train neural networks

Common questions

FAQ

How does gradient descent work?+

It starts from an initial set of parameters, computes the gradient of the loss with respect to them, and moves the parameters in the opposite (downhill) direction by a step set by the learning rate. Repeating this reduces the loss toward a minimum.

What is the learning rate?+

The learning rate controls how large each step is. Too high and the algorithm may overshoot and fail to converge; too low and training is very slow. It is a hyperparameter set before training, and choosing it well is important for good results.

What is stochastic gradient descent?+

Stochastic gradient descent estimates the gradient from a single example (or, in mini-batch form, a small subset) rather than the whole dataset. This makes each step much cheaper and is the standard approach for training large models such as deep neural networks.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.