Data science & AI · Reference

What is the F1 score?

The F1 score is a single metric that combines precision and recall by taking their harmonic mean, giving a balanced measure of a classifier's performance that penalises models which neglect either one.

Combining precision and recall

A classifier's precision and recall often trade off against each other, which makes comparing models awkward when one has higher precision and the other higher recall. The F1 score resolves this by combining them into one figure. It is the harmonic mean of the two — calculated as 2 × (precision × recall) ÷ (precision + recall) — rather than a simple average, which is what gives it its useful behaviour on this pair of metrics.

Why the harmonic mean

The harmonic mean is used deliberately because it is dominated by the smaller of the two values. If either precision or recall is low, the F1 score is pulled close to that low value rather than splitting the difference.

For example, a model with precision 1.0 and recall 0.0 has an arithmetic mean of 0.5 but an F1 score of 0. This ensures a high F1 score requires both metrics to be reasonably high — exactly what is wanted when neither false positives nor false negatives can be ignored.

Variants and limitations

The F1 score is a special case of the more general F-beta score, which lets recall be weighted more heavily than precision, or vice versa, when one matters more. For multi-class problems, F1 can be averaged across classes in different ways (for instance "macro" or "micro" averaging), and the choice affects the result. The F1 score also ignores true negatives, so it is not always the right summary; the appropriate metric still depends on the costs of each kind of error.

The F1 score in research

The F1 score is a common headline metric for classification, especially on imbalanced datasets where accuracy is misleading. Reported responsibly, it is computed on held-out data and accompanied by the underlying precision and recall, since the single number hides the balance between them. For multi-class results, the averaging method should be stated, because macro and micro F1 can differ substantially when class sizes are uneven.

Key facts

At a glance

Definition: harmonic mean of precision and recall
Formula: 2 × (precision × recall) ÷ (precision + recall)
Range: 0 to 1 (1 is perfect)
High only when both precision and recall are high
Generalises to the F-beta score (weighting recall vs precision)
Ignores true negatives

Common questions

FAQ

How is the F1 score calculated?+

The F1 score is the harmonic mean of precision and recall: 2 × (precision × recall) ÷ (precision + recall). It ranges from 0 to 1, and is high only when both precision and recall are high.

Why use the harmonic mean instead of a simple average?+

The harmonic mean is dominated by the smaller value, so a low precision or low recall pulls the F1 score down sharply. This prevents a model from scoring well by maximising one metric while neglecting the other.

When should the F1 score be used?+

It is most useful for classification on imbalanced data, where accuracy is misleading and both false positives and false negatives matter. When one type of error is far more costly, a weighted F-beta score or the raw precision and recall may be more informative.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.