Explainer · Plain-language
Text And Data Mining: Definition, Meaning & Examples | CASRAI
Text and data mining (TDM) is the automated, computational analysis of large bodies of text and data to discover patterns, trends, and relationships that would be impractical to find by reading alone. In research it powers literature-scale analysis, knowledge extraction, and increasingly the assembly of datasets for training artificial-intelligence models.
The step most authors miss
Doing CRediT right? Don’t stop at the statement.
A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.
Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.
What TDM is and what it is used for
Text and data mining refers to automated techniques for analysing text and data in digital form to generate information such as patterns, trends, and correlations. Researchers use it to survey thousands of papers at once, extract named entities and relationships, build knowledge graphs, track concepts across the literature, and prepare large datasets for downstream computation. The defining feature is scale: TDM addresses questions that depend on processing far more material than a person could read.
The copyright challenge
A practical obstacle is that mining typically involves making copies of works — downloading, reformatting, and storing them for processing — which can implicate copyright even when no part of a work is republished. This is why TDM is governed by specific legal exceptions rather than left to general fair-dealing or fair-use reasoning alone. Whether a given act of mining is lawful depends on the jurisdiction, the purpose (research versus commercial), and the terms under which the content was accessed.
EU and UK exceptions
The EU’s DSM Directive provides two TDM exceptions: Article 3 covers reproductions and extractions by research organisations and cultural-heritage institutions for scientific research, and cannot be overridden by contract; Article 4 provides a broader TDM exception for any purpose (including commercial) but allows rights holders to opt out — to reserve their rights, for example in machine-readable form. The UK has its own statutory exception permitting copies for TDM, which has historically been limited to non-commercial research, with the scope of any wider exception a continuing policy question.
Licensing and AI training data
Where an exception does not apply — for instance commercial mining in a jurisdiction without a broad exception, or content whose rights holders have opted out — TDM relies on licences from publishers or data providers, which may set technical and usage conditions. The use of mined text and data to train AI models has sharpened these questions, raising debate over whether existing exceptions cover model training, how opt-outs should work, and what transparency and remuneration rights holders are owed. This remains a fast-moving legal and policy area.
Key facts
At a glance
- Definition: Automated computational analysis of large text and data corpora
- Uses: Literature-scale analysis, knowledge extraction, AI datasets
- Legal hook: Mining involves copying, so it engages copyright law
- EU law: DSM Directive Arts 3 (research) and 4 (general, opt-out)
- UK law: A statutory TDM exception, historically non-commercial research
- Debate: Licensing and the use of mined content as AI training data
Common misconceptions
What people often get wrong
Often heard: TDM is always legal because no work is republished.
Actually: No — mining usually requires copying works, which can engage copyright. Lawfulness depends on a specific exception or a licence, the jurisdiction, and whether the use is research or commercial.
Often heard: The EU and UK TDM exceptions are the same.
Actually: No — they differ in scope. The EU DSM Directive has separate research (Art 3) and general opt-out (Art 4) exceptions; the UK exception has historically been limited to non-commercial research.
Often heard: TDM and AI model training are clearly covered by existing law.
Actually: No — whether TDM exceptions extend to training AI models, and how rights-holder opt-outs apply, is contested and actively evolving in both law and policy.
Going deeper








