Tag: transformers

  • Large Language Models in Research: An Explainer

    A large language model (LLM) is a type of artificial-intelligence model, built on the transformer neural-network architecture, that is trained on very large quantities of text to predict and generate language. At its core, an LLM learns the statistical patterns of language by repeatedly predicting the next token in a sequence; after training on enough text, this simple objective yields a system that can answer questions, summarise, translate and draft prose. Understanding how LLMs work — and where they fail — is now essential for researchers who use or evaluate them.

    Transformers and tokens

    The transformer, introduced in 2017, is the architecture underlying modern LLMs. Its key innovation is the attention mechanism, which lets the model weigh the relevance of different parts of the input when processing each element, capturing long-range relationships in text efficiently and in parallel. This made it practical to train far larger models than earlier sequence architectures allowed.

    LLMs do not read words directly. Text is broken into tokens — units that may be whole words, parts of words or punctuation — and each token is converted into a numerical vector. The model processes sequences of these tokens and predicts the next one, assigning probabilities across its vocabulary. Generation proceeds token by token. Because models have a finite context window, the amount of text they can consider at once is bounded, which matters when working with long documents.

    Pretraining and fine-tuning

    LLMs are typically built in two stages. Pretraining exposes the model to a vast, broad corpus, during which it learns general language patterns through next-token prediction — this is the costly, compute-intensive stage. Fine-tuning then adapts the pretrained model to specific tasks or behaviours using smaller, targeted datasets. A widely used form of alignment further tunes models with human feedback so their responses are more helpful and follow instructions. This two-stage design is why a single pretrained base can be specialised for many downstream uses, connecting LLMs to the broader story of neural networks and deep learning.

    Capabilities and limitations

    LLMs are capable assistants for drafting, summarising, translating, extracting information and explaining concepts. But their limitations are intrinsic, not incidental, and researchers must keep them in view.

    Capability Corresponding limitation
    Fluent, plausible text generation Hallucination — confident but false statements
    Broad knowledge from training data Knowledge cut-off; no awareness of newer events
    Summarising and synthesising sources Weak provenance — cannot reliably cite where claims came from
    Following instructions Sensitivity to phrasing; potential to reflect training-data bias

    The most important limitation for scholarship is hallucination: because an LLM generates statistically likely text rather than retrieving verified facts, it can produce fabricated references, false figures and incorrect claims stated with full confidence. It also lacks reliable provenance — it cannot, by default, tell you which source a statement came from. Outputs must therefore be independently verified, not trusted at face value.

    Responsible use and disclosure in research

    Used responsibly, LLMs can accelerate literature triage, drafting and coding. Used uncritically, they introduce errors, fabricated citations and undisclosed authorship concerns. Many journals and funders now require disclosure of generative-AI use in manuscripts, and most editorial policies hold that an LLM cannot be an author because it cannot take responsibility for the work. Good practice is to verify every factual claim and reference, keep a record of how the tool was used, and report that use transparently. Outputs produced or assisted by LLMs should be treated as research outputs subject to the same scrutiny and documentation as any other, described with consistent terminology. Our guidance for authors covers disclosure and documentation expectations, and reliable handling of model outputs intersects with sound data infrastructure and metadata practice.

    Frequently asked questions

    What is a token in a large language model?

    A token is the unit of text an LLM processes — a whole word, part of a word, or punctuation. Text is split into tokens and converted to numerical vectors; the model predicts the next token in sequence. A model’s context window limits how many tokens it can consider at once.

    What is the difference between pretraining and fine-tuning?

    Pretraining teaches a model general language patterns from a vast, broad corpus and is computationally expensive. Fine-tuning then adapts that pretrained model to specific tasks or behaviours using smaller, targeted datasets, so one base model can be specialised for many uses.

    Why do large language models hallucinate?

    Because they generate statistically likely text rather than retrieving verified facts. An LLM predicts plausible continuations, so it can state fabricated references or false figures with full confidence. Outputs must be independently verified, since the model has no built-in mechanism guaranteeing factual accuracy.

    Should I disclose using an LLM in my research?

    Yes. Many journals and funders require disclosure of generative-AI use, and most hold that an LLM cannot be a named author. Verify all claims and references, record how the tool was used, and report that use transparently in line with relevant editorial policy.

  • Natural Language Processing (NLP) in Research: A Plain Guide

    Natural language processing (NLP) is the field of artificial intelligence concerned with making human language machine-processable, so computers can read, interpret, generate and respond to text and speech. It combines linguistics, statistics and machine learning to turn unstructured language into structured signals a model can work with. NLP now underpins search engines, translation tools, literature-screening systems and the large language models behind modern research assistants.

    From raw text to numbers

    Computers operate on numbers, not words, so the first job of any NLP pipeline is to convert language into a numerical form. Two steps dominate this process.

    Tokenisation splits text into smaller units called tokens, which may be words, sub-words or characters. Modern systems favour sub-word tokenisation because it handles rare words and morphology gracefully without an unmanageably large vocabulary.

    Embeddings then map each token to a dense vector of numbers, positioning words with similar meanings near one another in a high-dimensional space. Because embeddings capture semantic relationships learned from large text corpora, “clinician” and “physician” sit close together while “clinician” and “granite” do not. This numerical representation is what downstream models actually learn from. The reliance on learned representations connects NLP to the wider field, which we introduce in what is machine learning.

    Transformers: the architecture that changed NLP

    The transformer, introduced in 2017, is the architecture behind most current NLP systems. Its key innovation is the attention mechanism, which lets the model weigh the relevance of every word to every other word in a sequence, regardless of distance. This captures long-range context that earlier sequential models struggled with, and it parallelises well, enabling training on vast corpora. Large language models are transformers scaled to billions of parameters and trained on enormous text collections.

    Common NLP tasks

    NLP is best understood through the tasks it performs. The table below lists those most relevant to research.

    Task What it does Research example
    Text classification Assigns a category to a document Screening abstracts for a systematic review
    Named entity recognition Identifies entities such as genes, drugs or places Extracting chemical names from papers
    Machine translation Converts text between languages Reading non-English literature
    Summarisation Condenses long text into key points Digesting large document collections
    Question answering Returns answers from a body of text Querying a corpus of protocols

    How researchers use NLP

    Across disciplines, NLP accelerates work that would be impractical by hand. Systematic reviewers use classification to triage thousands of abstracts. Biomedical teams use named entity recognition to mine entities from the literature at scale. Social scientists apply topic modelling and sentiment analysis to large text archives. Curators and metadata specialists increasingly use NLP to normalise terminology against controlled vocabularies such as the CASRAI dictionary, improving the consistency of research records.

    Caveats and reproducibility concerns

    NLP systems inherit the limitations of their training data. Models can encode and amplify bias present in source corpora; they can produce fluent but factually wrong output, often called hallucination; and their behaviour can shift when an underlying model is updated. For research use, these issues raise real reproducibility questions: a result obtained from one model version may not replicate on the next. Documenting the exact model, version, prompt and preprocessing is therefore essential, a theme we explore in our coverage of reproducibility of machine learning research and our broader AI and ML research outputs hub. Treating NLP as a tool whose outputs require human verification, not an oracle, keeps it trustworthy.

    Frequently asked questions

    What is the difference between NLP and machine learning?

    Machine learning is the general study of systems that learn patterns from data. NLP is the application of those techniques, among others, specifically to human language. Most modern NLP is built on machine learning, but they are not the same thing.

    What are embeddings in simple terms?

    Embeddings are lists of numbers that represent the meaning of a word or piece of text, arranged so that similar meanings have similar numbers. They let a model treat “begin” and “start” as related while keeping unrelated words apart.

    Why are transformers so important in NLP?

    Transformers use an attention mechanism to weigh the relevance of all words in a sequence at once, capturing long-range context and training efficiently at scale. They are the foundation of nearly all current large language models.

    Can I trust NLP output in research?

    Only with verification. NLP models can be biased, can fabricate plausible-sounding content, and can change between versions. Record the model, version and settings, and check outputs against authoritative sources, as set out in our guidance for authors.