A neural-network architecture in which a learned router directs each input (or token) to a small subset of specialist sub-networks ('experts'), so that the model has a large total parameter count but uses only a fraction per forward pass.

ByCASRAI Editorial Board

· Last updated 21 May 2026

Examples

Worked examples

Is an instance
A model with 8 experts of 7B parameters each, routing 2 experts per token (active ~12B of 56B total).
Is an instance
A trillion-parameter MoE deployed with single-digit-billion active parameters per token.

Counter-examples

Looks similar, but isn't

Not an instance
A dense transformer with every parameter used for every token.
Not an instance
An ensemble of independently trained models combined by averaging.

Editorial commentary

MoE models decouple parameter count from per-token compute, enabling very large models with manageable inference cost. Shazeer et al. (2017) reintroduced sparse MoE for deep learning; recent open and closed deployments (Mixtral, GLaM, Gemini, GPT-4) make MoE central to frontier-model engineering. Reporting practice distinguishes 'total parameters' from 'active parameters per token'.

References

Shazeer et al., 'Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer' (ICLR 2017); Fedus, Zoph, Shazeer, 'Switch Transformer' (JMLR 2022).

Also known as

MoE · sparse mixture-of-experts

Machine-readable encodings

Use in your systems

JATS XML <role> element

xml

<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Mixture-of-experts (MoE)"
      vocab-term-identifier="https://casrai.org/dictionary/term/mixture-of-experts-moe" />

Schema.org DefinedTerm (JSON-LD)

json

{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "@id": "https://casrai.org/dictionary/term/mixture-of-experts-moe",
  "name": "Mixture-of-experts (MoE)",
  "identifier": "https://casrai.org/dictionary/term/mixture-of-experts-moe",
  "description": "A neural-network architecture in which a learned router directs each input (or token) to a small subset of specialist sub-networks ('experts'), so that the model has a large total parameter count but uses only a fraction per forward pass.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-ml-research-outputs#set",
  "url": "https://casrai.org/dictionary/term/mixture-of-experts-moe",
  "sameAs": [
    "MoE",
    "sparse mixture-of-experts"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "publisher": {
    "@id": "https://casrai.org/#organization"
  },
  "dateModified": "2026-05-21T02:22:51",
  "inLanguage": "en"
}

Referenced across the research world

View CASRAI adoption →