Examples
Worked examples
- Is an instance
A model with 8 experts of 7B parameters each, routing 2 experts per token (active ~12B of 56B total).
- Is an instance
A trillion-parameter MoE deployed with single-digit-billion active parameters per token.
Counter-examples
Looks similar, but isn't
- Not an instance
A dense transformer with every parameter used for every token.
- Not an instance
An ensemble of independently trained models combined by averaging.
Editorial commentary
MoE models decouple parameter count from per-token compute, enabling very large models with manageable inference cost. Shazeer et al. (2017) reintroduced sparse MoE for deep learning; recent open and closed deployments (Mixtral, GLaM, Gemini, GPT-4) make MoE central to frontier-model engineering. Reporting practice distinguishes 'total parameters' from 'active parameters per token'.
References
- Shazeer et al., 'Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer' (ICLR 2017); Fedus, Zoph, Shazeer, 'Switch Transformer' (JMLR 2022).
Also known as
MoE · sparse mixture-of-experts
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="Mixture-of-experts (MoE)"
vocab-term-identifier="https://casrai.org/dictionary/term/mixture-of-experts-moe" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "Mixture-of-experts (MoE)",
"identifier": "https://casrai.org/dictionary/term/mixture-of-experts-moe",
"description": "A neural-network architecture in which a learned router directs each input (or token) to a small subset of specialist sub-networks ('experts'), so that the model has a large total parameter count but uses only a fraction per forward pass.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
"url": "https://casrai.org/dictionary/term/mixture-of-experts-moe",
"sameAs": [
"MoE",
"sparse mixture-of-experts"
],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







