Data science & AI · Reference

What is natural language processing?

Natural language processing is the field at the intersection of computer science, artificial intelligence, and linguistics concerned with enabling computers to read, interpret, generate, and respond to human language.

What NLP covers

NLP addresses the gap between human language — which is ambiguous, context-dependent, and varied — and the precise representations computers require. It spans low-level steps such as tokenisation, part-of-speech tagging, and syntactic parsing, and higher-level tasks such as named-entity recognition, sentiment analysis, machine translation, summarisation, and question answering. Because the same words can mean different things in different contexts, handling ambiguity and context is a recurring challenge across all of these tasks.

From rules to neural models

Early NLP relied on hand-written grammatical rules and lexicons. From the 1990s, statistical methods learned from annotated corpora improved robustness.

The current era is defined by neural networks and especially the transformer architecture, introduced in 2017. Transformer-based models learn rich representations of language from vast unlabelled text and can be adapted to many tasks, which is why a single pre-trained model now underlies translation, summarisation, and dialogue systems alike.

Common NLP tasks

Typical tasks include machine translation (rendering text in another language), information extraction (pulling structured facts from text), sentiment analysis (classifying opinion), summarisation, and question answering. Each is evaluated with task-specific metrics — for example BLEU for translation or F1 for extraction — measured against human-annotated reference data. Speech-based tasks add automatic speech recognition and text-to-speech.

NLP in research

NLP methods help researchers mine large text collections — scientific literature, clinical notes, historical archives — for patterns at scale. Methodologically, NLP research depends on shared benchmarks, careful annotation guidelines, and reporting of inter-annotator agreement. Known concerns include bias inherited from training corpora, sensitivity to domain shift, and the difficulty of evaluating open-ended generation. Findings are validated against held-out, human-labelled data.

Key facts

At a glance

Field: AI plus computational linguistics
Goal: computers that process and generate human language
Low-level tasks: tokenisation, tagging, parsing
High-level tasks: translation, summarisation, question answering
Modern basis: transformer architecture (2017)
Key challenge: ambiguity and context in language

Common questions

FAQ

What is natural language processing used for?+

NLP powers machine translation, search, sentiment analysis, information extraction, summarisation, question answering, and conversational systems. In research it is used to analyse large text collections such as scientific literature or archives.

How is modern NLP different from older approaches?+

Early NLP used hand-written rules; later systems used statistics learned from annotated corpora. Modern NLP uses neural, transformer-based models trained on very large text datasets, which generalise across many tasks.

What is the main difficulty in NLP?+

Human language is highly ambiguous and context-dependent: the same word or sentence can mean different things in different situations. Resolving this ambiguity reliably is the central challenge across NLP tasks.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.