Data science & AI · Reference

What is a random forest?

A random forest is a machine-learning model that combines the predictions of many decision trees, each trained on a different random sample of the data and features, to produce more accurate and stable results than a single tree.

An ensemble of trees

A random forest is an ensemble: it builds many decision trees and aggregates their outputs rather than relying on one. For classification it takes a majority vote across the trees; for regression it averages their predictions. The guiding idea is that many diverse, individually imperfect models, combined, are more accurate and more stable than any single one — provided the models make different kinds of error, so that their mistakes tend to cancel out rather than reinforce.

Bagging and feature randomness

Two sources of randomness make the trees diverse. The first is bagging (bootstrap aggregating): each tree is trained on a random sample of the data drawn with replacement, so every tree sees a slightly different dataset.

The second is feature randomness: at each split, a tree considers only a random subset of features rather than all of them. Together these decorrelate the trees, which is what allows averaging to cut variance and reduce overfitting.

Strengths and trade-offs

Random forests are accurate, robust, and require relatively little tuning, which makes them a strong default for many tabular-data problems. They resist overfitting better than single trees and can rank the importance of features. The trade-offs are reduced interpretability — a forest of hundreds of trees cannot be read like one tree — and greater computational cost. Random forests, introduced by Leo Breiman in 2001, remain one of the most widely used and dependable supervised learning methods.

Random forests in research

In research, random forests are a popular, dependable baseline for prediction on structured data, often performing well with little tuning. Their feature-importance estimates can suggest which variables matter, though these must be interpreted cautiously, as they can be biased by correlated or high-cardinality features. As with any model, performance should be assessed on held-out data via cross-validation, and the random seed and settings reported for reproducibility.

Key facts

At a glance

Definition: ensemble of many decision trees
Combination: majority vote (classification) or average (regression)
Bagging: each tree trained on a bootstrap sample
Feature randomness: random feature subset at each split
Benefit: reduces variance and overfitting
Introduced: Leo Breiman, 2001

Common questions

FAQ

How does a random forest work?+

It trains many decision trees, each on a random sample of the data and considering a random subset of features at each split, then combines their predictions by voting or averaging. The diversity among trees makes the combined result more accurate and stable.

Why is a random forest better than a single decision tree?+

A single tree easily overfits and is unstable. By averaging many decorrelated trees, a random forest reduces variance, generalises better, and is more robust — at the cost of the single tree's interpretability and some extra computation.

What is bagging?+

Bagging, or bootstrap aggregating, trains each model on a random sample of the data drawn with replacement and combines their predictions. In a random forest it is one of the two sources of randomness, alongside selecting a random subset of features at each split.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.