Data science & AI · Reference

What is tokenization?

Tokenization is the process of breaking text into smaller units called tokens — such as words, subwords, or characters — so that a computer can process language numerically, a foundational step in NLP and large language models.

What tokenization does

Computers cannot operate on raw text directly; they work with numbers. Tokenization bridges this gap by breaking a string of text into discrete units — tokens — which are then converted into numerical identifiers a model can process. How text is split matters: it determines the vocabulary the model sees and how it handles rare or unseen words. Tokenization is therefore one of the first and most consequential steps in any NLP pipeline, shaping everything that follows.

Types of tokens

Word tokenization splits text on words, which is intuitive but produces very large vocabularies and struggles with words it has never seen. Character tokenization uses individual characters, giving a tiny vocabulary but very long sequences.

Subword tokenization is the modern compromise used by most large language models: it breaks words into frequently occurring fragments using algorithms such as byte-pair encoding (BPE) or WordPiece. This keeps the vocabulary manageable while representing rare words as combinations of known subword pieces.

Why tokenization matters for LLMs

In large language models, the token is the fundamental unit: the model predicts text one token at a time, and its context window and usage are measured in tokens. The choice of tokenizer affects efficiency, how well the model handles different languages, and how it breaks down rare words. Subword tokenization lets a fixed vocabulary cover an effectively unlimited range of text, which is why it underpins modern NLP. Quirks of tokenization can also explain surprising model behaviours, such as difficulty with character-level tasks.

Tokenization in research

For reproducible NLP research, the exact tokenizer is part of the method, because different tokenization schemes change a model's inputs and can affect results. Token counts also influence comparisons of efficiency and cost across models. Researchers report the tokenizer used and consider its effect on languages and scripts beyond the one a model was primarily trained on, since tokenization that suits English may fragment other languages inefficiently and bias evaluation.

Key facts

At a glance

Definition: splitting text into tokens for processing
Token types: word, subword, character
Subword algorithms: byte-pair encoding (BPE), WordPiece
Purpose: convert text into units mapped to numbers
LLM relevance: models predict and are measured in tokens
Foundational step: first stage of most NLP pipelines

Common questions

FAQ

What is a token in NLP?+

A token is a small unit of text — a word, a subword fragment, or a character — produced by splitting a string during tokenization. Tokens are then mapped to numerical identifiers so that a model can process the text.

What is subword tokenization?+

Subword tokenization splits words into frequently occurring fragments using algorithms such as byte-pair encoding or WordPiece. It keeps the vocabulary a manageable size while representing rare or unseen words as combinations of known subword pieces, and is standard in modern large language models.

Why does tokenization matter for large language models?+

LLMs predict text one token at a time, and their context limits and usage are measured in tokens. The tokenizer affects efficiency, handling of different languages, and treatment of rare words, so it is an important and sometimes overlooked design choice.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.