Skip to main content
v2026.1714 entries · CC-BY 4.0
Dictionary termTrack CStablev2026.2

RLHF (Reinforcement Learning from Human Feedback)

A training methodology in which a language model is fine-tuned using a reward signal derived from human preferences over pairs (or larger sets) of candidate model outputs, typically by first training a reward model and then optimising the policy against it via PPO or a related algorithm.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    An instruction-tuned model fine-tuned with 100k human preference comparisons over candidate responses.

  • Is an instance

    A DPO-trained model using preference data without an explicit reward-model step.

Counter-examples

Looks similar, but isn't

  • Not an instance

    Pure supervised fine-tuning on labelled instruction data (SFT, not RLHF).

  • Not an instance

    Pre-training next-token prediction on web text.

Editorial commentary

RLHF (Christiano et al., 2017; Ouyang et al., 2022) became the dominant alignment technique for chat-tuned LLMs from 2022. Variants and successors include DPO (Direct Preference Optimisation), IPO, KTO, and RLAIF (RL from AI Feedback, as in Constitutional AI). RLHF is documented in fine-tune lineage metadata when applied to a base model.

References

  • Christiano et al., 'Deep reinforcement learning from human preferences' (NeurIPS 2017); Ouyang et al., 'Training language models to follow instructions with human feedback' (NeurIPS 2022).

Also known as

RLHF

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="RLHF (Reinforcement Learning from Human Feedback)"
      vocab-term-identifier="https://casrai.org/dictionary/term/rlhf-reinforcement-learning-from-human-feedback" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "RLHF (Reinforcement Learning from Human Feedback)",
  "identifier": "https://casrai.org/dictionary/term/rlhf-reinforcement-learning-from-human-feedback",
  "description": "A training methodology in which a language model is fine-tuned using a reward signal derived from human preferences over pairs (or larger sets) of candidate model outputs, typically by first training a reward model and then optimising the policy against it via PPO or a related algorithm.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
  "url": "https://casrai.org/dictionary/term/rlhf-reinforcement-learning-from-human-feedback",
  "sameAs": [
    "RLHF"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →