Tag: tokens

  • Large Language Models in Research: An Explainer

    A large language model (LLM) is a type of artificial-intelligence model, built on the transformer neural-network architecture, that is trained on very large quantities of text to predict and generate language. At its core, an LLM learns the statistical patterns of language by repeatedly predicting the next token in a sequence; after training on enough text, this simple objective yields a system that can answer questions, summarise, translate and draft prose. Understanding how LLMs work — and where they fail — is now essential for researchers who use or evaluate them.

    Transformers and tokens

    The transformer, introduced in 2017, is the architecture underlying modern LLMs. Its key innovation is the attention mechanism, which lets the model weigh the relevance of different parts of the input when processing each element, capturing long-range relationships in text efficiently and in parallel. This made it practical to train far larger models than earlier sequence architectures allowed.

    LLMs do not read words directly. Text is broken into tokens — units that may be whole words, parts of words or punctuation — and each token is converted into a numerical vector. The model processes sequences of these tokens and predicts the next one, assigning probabilities across its vocabulary. Generation proceeds token by token. Because models have a finite context window, the amount of text they can consider at once is bounded, which matters when working with long documents.

    Pretraining and fine-tuning

    LLMs are typically built in two stages. Pretraining exposes the model to a vast, broad corpus, during which it learns general language patterns through next-token prediction — this is the costly, compute-intensive stage. Fine-tuning then adapts the pretrained model to specific tasks or behaviours using smaller, targeted datasets. A widely used form of alignment further tunes models with human feedback so their responses are more helpful and follow instructions. This two-stage design is why a single pretrained base can be specialised for many downstream uses, connecting LLMs to the broader story of neural networks and deep learning.

    Capabilities and limitations

    LLMs are capable assistants for drafting, summarising, translating, extracting information and explaining concepts. But their limitations are intrinsic, not incidental, and researchers must keep them in view.

    Capability Corresponding limitation
    Fluent, plausible text generation Hallucination — confident but false statements
    Broad knowledge from training data Knowledge cut-off; no awareness of newer events
    Summarising and synthesising sources Weak provenance — cannot reliably cite where claims came from
    Following instructions Sensitivity to phrasing; potential to reflect training-data bias

    The most important limitation for scholarship is hallucination: because an LLM generates statistically likely text rather than retrieving verified facts, it can produce fabricated references, false figures and incorrect claims stated with full confidence. It also lacks reliable provenance — it cannot, by default, tell you which source a statement came from. Outputs must therefore be independently verified, not trusted at face value.

    Responsible use and disclosure in research

    Used responsibly, LLMs can accelerate literature triage, drafting and coding. Used uncritically, they introduce errors, fabricated citations and undisclosed authorship concerns. Many journals and funders now require disclosure of generative-AI use in manuscripts, and most editorial policies hold that an LLM cannot be an author because it cannot take responsibility for the work. Good practice is to verify every factual claim and reference, keep a record of how the tool was used, and report that use transparently. Outputs produced or assisted by LLMs should be treated as research outputs subject to the same scrutiny and documentation as any other, described with consistent terminology. Our guidance for authors covers disclosure and documentation expectations, and reliable handling of model outputs intersects with sound data infrastructure and metadata practice.

    Frequently asked questions

    What is a token in a large language model?

    A token is the unit of text an LLM processes — a whole word, part of a word, or punctuation. Text is split into tokens and converted to numerical vectors; the model predicts the next token in sequence. A model’s context window limits how many tokens it can consider at once.

    What is the difference between pretraining and fine-tuning?

    Pretraining teaches a model general language patterns from a vast, broad corpus and is computationally expensive. Fine-tuning then adapts that pretrained model to specific tasks or behaviours using smaller, targeted datasets, so one base model can be specialised for many uses.

    Why do large language models hallucinate?

    Because they generate statistically likely text rather than retrieving verified facts. An LLM predicts plausible continuations, so it can state fabricated references or false figures with full confidence. Outputs must be independently verified, since the model has no built-in mechanism guaranteeing factual accuracy.

    Should I disclose using an LLM in my research?

    Yes. Many journals and funders require disclosure of generative-AI use, and most hold that an LLM cannot be a named author. Verify all claims and references, record how the tool was used, and report that use transparently in line with relevant editorial policy.