Data science & AI · Reference

What is reinforcement learning?

Reinforcement learning is the branch of machine learning in which an agent learns to make decisions by interacting with an environment, choosing actions that maximise a cumulative reward signal through trial and error.

Agent, environment and reward

Reinforcement learning is framed around an agent that interacts with an environment. At each step the agent observes a state, chooses an action, and receives a numerical reward together with a new state. The agent's objective is to maximise the cumulative reward over time, not the immediate reward alone. The rule it follows for choosing actions is its policy. Unlike supervised learning, there are no labelled correct actions; the agent must discover good behaviour from the reward signal it receives.

Exploration, exploitation and delayed reward

Two features distinguish reinforcement learning. First, the exploration–exploitation trade-off: the agent must exploit actions it knows to be good while still exploring others that might be better.

Second, rewards can be delayed — an action taken now may only pay off much later, making it hard to assign credit to the right decisions. Handling delayed reward and balancing exploration against exploitation are central challenges that distinguish the field from other forms of learning.

Where reinforcement learning is used

Reinforcement learning suits problems framed as sequential decisions: game playing, robotics control, recommendation, and resource scheduling. It drew wide attention through systems that reached or exceeded human performance at complex games. More recently, reinforcement learning from human feedback has been used to align large language models with human preferences. It is one of the three main paradigms of machine learning, alongside supervised and unsupervised learning.

Reinforcement learning in research

In research, reinforcement learning is studied both as a model of learning and as a practical method for control and decision problems. Reproducibility is a known difficulty: results can be highly sensitive to random seeds, reward design, and hyperparameters, and poorly specified rewards can produce unintended behaviour ("reward hacking"). Sound practice reports the environment, reward function, and training details precisely, and evaluates across multiple seeds rather than a single fortunate run.

Key facts

At a glance

Field: subtype of machine learning
Core idea: learn from reward and penalty through interaction
Key elements: agent, environment, state, action, reward
Strategy learned: a policy
Central trade-off: exploration vs exploitation
Challenge: delayed reward and credit assignment

Common questions

FAQ

How does reinforcement learning differ from supervised learning?+

Supervised learning trains on labelled examples with known correct outputs. Reinforcement learning has no labelled answers; an agent learns by taking actions and receiving rewards, discovering good behaviour through trial and error over time.

What is the exploration–exploitation trade-off?+

It is the tension between exploiting actions already known to give good rewards and exploring new actions that might be even better. Balancing the two is essential because over-exploiting can miss superior strategies, while over-exploring wastes opportunities.

What is a policy in reinforcement learning?+

A policy is the agent's strategy for choosing an action in each state. The goal of reinforcement learning is to find a policy that maximises the cumulative reward the agent receives over time.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.