Skip to main content
v2026.1714 entries · CC-BY 4.0
CASRAI

Editorial · CASRAI · AI and ML research outputs

Natural Language Processing (NLP) in Research: A Plain Guide

Natural language processing makes human language machine-processable, from tokenisation and embeddings to transformer models. This guide explains the core building blocks, common tasks such as classification and translation, and what researchers should watch for when using NLP.

ByCASRAI Editorial Board
Published 18 Jun 2026· 4 minute read

Natural language processing (NLP) is the field of artificial intelligence concerned with making human language machine-processable, so computers can read, interpret, generate and respond to text and speech. It combines linguistics, statistics and machine learning to turn unstructured language into structured signals a model can work with. NLP now underpins search engines, translation tools, literature-screening systems and the large language models behind modern research assistants.

From raw text to numbers

Computers operate on numbers, not words, so the first job of any NLP pipeline is to convert language into a numerical form. Two steps dominate this process.

Tokenisation splits text into smaller units called tokens, which may be words, sub-words or characters. Modern systems favour sub-word tokenisation because it handles rare words and morphology gracefully without an unmanageably large vocabulary.

Embeddings then map each token to a dense vector of numbers, positioning words with similar meanings near one another in a high-dimensional space. Because embeddings capture semantic relationships learned from large text corpora, “clinician” and “physician” sit close together while “clinician” and “granite” do not. This numerical representation is what downstream models actually learn from. The reliance on learned representations connects NLP to the wider field, which we introduce in what is machine learning.

Transformers: the architecture that changed NLP

The transformer, introduced in 2017, is the architecture behind most current NLP systems. Its key innovation is the attention mechanism, which lets the model weigh the relevance of every word to every other word in a sequence, regardless of distance. This captures long-range context that earlier sequential models struggled with, and it parallelises well, enabling training on vast corpora. Large language models are transformers scaled to billions of parameters and trained on enormous text collections.

Common NLP tasks

NLP is best understood through the tasks it performs. The table below lists those most relevant to research.

Task What it does Research example
Text classification Assigns a category to a document Screening abstracts for a systematic review
Named entity recognition Identifies entities such as genes, drugs or places Extracting chemical names from papers
Machine translation Converts text between languages Reading non-English literature
Summarisation Condenses long text into key points Digesting large document collections
Question answering Returns answers from a body of text Querying a corpus of protocols

How researchers use NLP

Across disciplines, NLP accelerates work that would be impractical by hand. Systematic reviewers use classification to triage thousands of abstracts. Biomedical teams use named entity recognition to mine entities from the literature at scale. Social scientists apply topic modelling and sentiment analysis to large text archives. Curators and metadata specialists increasingly use NLP to normalise terminology against controlled vocabularies such as the CASRAI dictionary, improving the consistency of research records.

Caveats and reproducibility concerns

NLP systems inherit the limitations of their training data. Models can encode and amplify bias present in source corpora; they can produce fluent but factually wrong output, often called hallucination; and their behaviour can shift when an underlying model is updated. For research use, these issues raise real reproducibility questions: a result obtained from one model version may not replicate on the next. Documenting the exact model, version, prompt and preprocessing is therefore essential, a theme we explore in our coverage of reproducibility of machine learning research and our broader AI and ML research outputs hub. Treating NLP as a tool whose outputs require human verification, not an oracle, keeps it trustworthy.

Frequently asked questions

What is the difference between NLP and machine learning?

Machine learning is the general study of systems that learn patterns from data. NLP is the application of those techniques, among others, specifically to human language. Most modern NLP is built on machine learning, but they are not the same thing.

What are embeddings in simple terms?

Embeddings are lists of numbers that represent the meaning of a word or piece of text, arranged so that similar meanings have similar numbers. They let a model treat “begin” and “start” as related while keeping unrelated words apart.

Why are transformers so important in NLP?

Transformers use an attention mechanism to weigh the relevance of all words in a sequence at once, capturing long-range context and training efficiently at scale. They are the foundation of nearly all current large language models.

Can I trust NLP output in research?

Only with verification. NLP models can be biased, can fabricate plausible-sounding content, and can change between versions. Record the model, version and settings, and check outputs against authoritative sources, as set out in our guidance for authors.

LAC

Partner Deal

LAC Health Supplies Mobile App

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →