Data science & AI · Reference
What is a transformer model?
A transformer is a neural-network architecture, introduced in 2017, that uses a self-attention mechanism to weigh the relationships between all elements of a sequence at once, and now underpins most modern large language models.
Self-attention
The defining feature of the transformer is self-attention. For each element of a sequence — for example each token in a sentence — the mechanism computes how relevant every other element is, and forms a representation that blends in the most relevant context. This lets the model capture long-range relationships directly, regardless of how far apart elements are. Crucially, self-attention processes all positions in parallel, which contrasts with earlier recurrent networks that had to read a sequence step by step.
Why transformers scaled
Because self-attention is parallelisable, transformers train far more efficiently on modern hardware than sequential models, which made it practical to train very large models on very large datasets.
Combined with the architecture's ability to model long-range dependencies, this scalability is why transformers displaced recurrent networks for language and underpin the large language models and other generative-AI systems that followed.
Beyond language
Although introduced for machine translation, the transformer proved to be a general-purpose architecture. Variants now handle images (vision transformers), audio, and protein structures, among other data. Its flexibility comes from treating input as a set of elements with learned relationships, an idea that transfers across deep-learning domains. This generality is why a single architecture underlies systems as varied as text generators, image models, and scientific prediction tools.
Transformers in research
The transformer is one of the most influential architectures in modern machine learning, and understanding its behaviour — how attention works, why scale helps, and where it fails — is an active research area. Because transformer-based models are large and often trained on proprietary data, reproducibility and transparency are recurring concerns; documenting the architecture, training data, and settings is essential. As with all deep models, their outputs are empirical and require validation rather than being treated as authoritative.
Key facts
At a glance
- Introduced: Vaswani et al., 2017 ("Attention Is All You Need")
- Type: neural-network architecture for sequences
- Key mechanism: self-attention
- Processes sequence positions in parallel (not step by step)
- Foundation of modern large language models
- Generalises beyond text to images, audio and more
Common questions
FAQ
What is self-attention?+
Self-attention is the mechanism by which a transformer weighs how relevant every element of a sequence is to every other, building each element's representation from the most relevant context. It captures long-range relationships and processes all positions in parallel.
Why did transformers replace recurrent neural networks?+
Recurrent networks read sequences step by step, which limits parallelism and makes long-range dependencies hard to learn. Transformers process all positions at once via self-attention, training more efficiently at scale and modelling long-range relationships better.
Are transformers only used for language?+
No. Although introduced for translation, transformers are now applied to images, audio, protein structures, and more. The architecture is general-purpose, which is why it underlies a wide range of modern AI systems beyond text.
The step most authors miss
Doing CRediT right? Don’t stop at the statement.
A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.
Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.







