Skip to main content
v2026.1714 entries · CC-BY 4.0
Dictionary termTrack CStablev2026.2

Mixture-of-experts (MoE)

A neural-network architecture in which a learned router directs each input (or token) to a small subset of specialist sub-networks ('experts'), so that the model has a large total parameter count but uses only a fraction per forward pass.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    A model with 8 experts of 7B parameters each, routing 2 experts per token (active ~12B of 56B total).

  • Is an instance

    A trillion-parameter MoE deployed with single-digit-billion active parameters per token.

Counter-examples

Looks similar, but isn't

  • Not an instance

    A dense transformer with every parameter used for every token.

  • Not an instance

    An ensemble of independently trained models combined by averaging.

Editorial commentary

MoE models decouple parameter count from per-token compute, enabling very large models with manageable inference cost. Shazeer et al. (2017) reintroduced sparse MoE for deep learning; recent open and closed deployments (Mixtral, GLaM, Gemini, GPT-4) make MoE central to frontier-model engineering. Reporting practice distinguishes 'total parameters' from 'active parameters per token'.

References

  • Shazeer et al., 'Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer' (ICLR 2017); Fedus, Zoph, Shazeer, 'Switch Transformer' (JMLR 2022).

Also known as

MoE · sparse mixture-of-experts

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Mixture-of-experts (MoE)"
      vocab-term-identifier="https://casrai.org/dictionary/term/mixture-of-experts-moe" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "Mixture-of-experts (MoE)",
  "identifier": "https://casrai.org/dictionary/term/mixture-of-experts-moe",
  "description": "A neural-network architecture in which a learned router directs each input (or token) to a small subset of specialist sub-networks ('experts'), so that the model has a large total parameter count but uses only a fraction per forward pass.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
  "url": "https://casrai.org/dictionary/term/mixture-of-experts-moe",
  "sameAs": [
    "MoE",
    "sparse mixture-of-experts"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}

Adopted by research universities worldwide

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • Massachusetts Institute of Technology logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo

View CASRAI adoption →