A training methodology in which a language model is fine-tuned using a reward signal derived from human preferences over pairs (or larger sets) of candidate model outputs, typically by first training a reward model and then optimising the policy against it via PPO or a related algorithm.

ByCASRAI Editorial Board

· Last updated 21 May 2026

Examples

Worked examples

Is an instance
An instruction-tuned model fine-tuned with 100k human preference comparisons over candidate responses.
Is an instance
A DPO-trained model using preference data without an explicit reward-model step.

Counter-examples

Looks similar, but isn't

Not an instance
Pure supervised fine-tuning on labelled instruction data (SFT, not RLHF).
Not an instance
Pre-training next-token prediction on web text.

Editorial commentary

RLHF (Christiano et al., 2017; Ouyang et al., 2022) became the dominant alignment technique for chat-tuned LLMs from 2022. Variants and successors include DPO (Direct Preference Optimisation), IPO, KTO, and RLAIF (RL from AI Feedback, as in Constitutional AI). RLHF is documented in fine-tune lineage metadata when applied to a base model.

References

Christiano et al., 'Deep reinforcement learning from human preferences' (NeurIPS 2017); Ouyang et al., 'Training language models to follow instructions with human feedback' (NeurIPS 2022).

Also known as

RLHF

Machine-readable encodings

Use in your systems

JATS XML <role> element

xml

<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="RLHF (Reinforcement Learning from Human Feedback)"
      vocab-term-identifier="https://casrai.org/dictionary/term/rlhf-reinforcement-learning-from-human-feedback" />

Schema.org DefinedTerm (JSON-LD)

json

{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "@id": "https://casrai.org/dictionary/term/rlhf-reinforcement-learning-from-human-feedback",
  "name": "RLHF (Reinforcement Learning from Human Feedback)",
  "identifier": "https://casrai.org/dictionary/term/rlhf-reinforcement-learning-from-human-feedback",
  "description": "A training methodology in which a language model is fine-tuned using a reward signal derived from human preferences over pairs (or larger sets) of candidate model outputs, typically by first training a reward model and then optimising the policy against it via PPO or a related algorithm.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-ml-research-outputs#set",
  "url": "https://casrai.org/dictionary/term/rlhf-reinforcement-learning-from-human-feedback",
  "sameAs": [
    "RLHF"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "publisher": {
    "@id": "https://casrai.org/#organization"
  },
  "dateModified": "2026-05-21T02:22:51",
  "inLanguage": "en"
}

Referenced across the research world

View CASRAI adoption →