Data science & AI · Reference
What is reinforcement learning?
Reinforcement learning is the branch of machine learning in which an agent learns to make decisions by interacting with an environment, choosing actions that maximise a cumulative reward signal through trial and error.
Agent, environment and reward
Reinforcement learning is framed around an agent that interacts with an environment. At each step the agent observes a state, chooses an action, and receives a numerical reward together with a new state. The agent's objective is to maximise the cumulative reward over time, not the immediate reward alone. The rule it follows for choosing actions is its policy. Unlike supervised learning, there are no labelled correct actions; the agent must discover good behaviour from the reward signal it receives.
Exploration, exploitation and delayed reward
Two features distinguish reinforcement learning. First, the exploration–exploitation trade-off: the agent must exploit actions it knows to be good while still exploring others that might be better.
Second, rewards can be delayed — an action taken now may only pay off much later, making it hard to assign credit to the right decisions. Handling delayed reward and balancing exploration against exploitation are central challenges that distinguish the field from other forms of learning.
Where reinforcement learning is used
Reinforcement learning suits problems framed as sequential decisions: game playing, robotics control, recommendation, and resource scheduling. It drew wide attention through systems that reached or exceeded human performance at complex games. More recently, reinforcement learning from human feedback has been used to align large language models with human preferences. It is one of the three main paradigms of machine learning, alongside supervised and unsupervised learning.
Reinforcement learning in research
In research, reinforcement learning is studied both as a model of learning and as a practical method for control and decision problems. Reproducibility is a known difficulty: results can be highly sensitive to random seeds, reward design, and hyperparameters, and poorly specified rewards can produce unintended behaviour ("reward hacking"). Sound practice reports the environment, reward function, and training details precisely, and evaluates across multiple seeds rather than a single fortunate run.
Key facts
At a glance
- Field: subtype of machine learning
- Core idea: learn from reward and penalty through interaction
- Key elements: agent, environment, state, action, reward
- Strategy learned: a policy
- Central trade-off: exploration vs exploitation
- Challenge: delayed reward and credit assignment
Common questions
FAQ
How does reinforcement learning differ from supervised learning?+
Supervised learning trains on labelled examples with known correct outputs. Reinforcement learning has no labelled answers; an agent learns by taking actions and receiving rewards, discovering good behaviour through trial and error over time.
What is the exploration–exploitation trade-off?+
It is the tension between exploiting actions already known to give good rewards and exploring new actions that might be even better. Balancing the two is essential because over-exploiting can miss superior strategies, while over-exploring wastes opportunities.
What is a policy in reinforcement learning?+
A policy is the agent's strategy for choosing an action in each state. The goal of reinforcement learning is to find a policy that maximises the cumulative reward the agent receives over time.
The step most authors miss
Doing CRediT right? Don’t stop at the statement.
A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.
Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.







