Data science & AI · Reference

What is a large language model?

A large language model is a neural network trained on very large amounts of text to predict the next unit of language, allowing it to generate fluent text and perform a wide range of language tasks.

How a large language model works

An LLM is trained on a self-supervised objective: given a stretch of text, predict the next token (a word or word fragment). Repeated over trillions of tokens, this teaches the model statistical regularities of language. The dominant architecture is the transformer, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., whose self-attention mechanism lets the model weigh relationships between all positions in a sequence. At generation time the model produces one token at a time, each conditioned on the text so far.

Capabilities

Because predicting text well requires modelling a great deal of structure, LLMs can perform many language tasks without task-specific training — summarising, translating, drafting, classifying, and answering questions — often guided only by instructions in the input, a practice known as prompt engineering.

Models are typically pre-trained on broad text and then fine-tuned or aligned (for example with human feedback) to follow instructions more reliably. This combination of scale and adaptability is what distinguishes LLMs from earlier task-specific systems.

Limitations and hallucination

An LLM predicts plausible text; it has no built-in notion of truth. It can therefore produce hallucinations — fluent statements that are factually wrong or fabricated, including invented citations. LLMs may also reflect biases in their training data, be sensitive to how a prompt is phrased, and lack knowledge of events after their training cut-off. These properties mean LLM output should be verified rather than trusted, especially for factual claims.

LLMs in research

In research, LLMs are studied as objects (their capabilities, biases, and reasoning) and used as tools (for literature triage, drafting, or data extraction). Methodologically, reproducibility is difficult because outputs are stochastic and models change between versions; sound practice records the exact model, version, and settings, and validates outputs against authoritative sources. Using LLM output as evidence without verification is regarded as a methodological error.

Key facts

At a glance

Definition: neural model trained to predict the next token
Architecture: transformer (Vaswani et al., 2017)
Key mechanism: self-attention
Training: self-supervised on very large text corpora
Adaptation: pre-training then fine-tuning / alignment
Key limitation: hallucination (confident, incorrect output)

Common questions

FAQ

How does a large language model generate text?+

It predicts text one token at a time, each token chosen based on the preceding text using probabilities learned during training. Stringing these predictions together produces fluent passages.

What is hallucination in a large language model?+

Hallucination is when an LLM produces fluent but factually wrong or fabricated content — including invented facts or citations. It happens because the model predicts plausible text rather than verifying truth, so outputs should always be checked.

What architecture do large language models use?+

Most LLMs are based on the transformer architecture introduced in 2017, which uses a self-attention mechanism to weigh relationships between all positions in a sequence.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.