Big data refers to datasets so large, fast-moving or varied that traditional database tools cannot capture, store or analyse them within a reasonable time. It is defined less by an exact size threshold than by a set of characteristics, usually summarised as the “Vs”, and by the distributed computing methods needed to process it. In research, big data spans genomics, sensor networks, clinical records, social media and large-scale simulations.
The defining Vs of big data
The concept began with three Vs and has since expanded. The table below sets out the five most widely cited.
| Characteristic | Meaning | Research example |
|---|---|---|
| Volume | The sheer quantity of data, from terabytes to petabytes and beyond | Whole-genome sequencing across cohorts |
| Velocity | The speed at which data is generated and must be processed | Real-time readings from environmental sensors |
| Variety | The mix of formats: structured, semi-structured and unstructured | Combining tables, images, text and audio |
| Veracity | The trustworthiness, accuracy and completeness of the data | Cleaning noisy or missing clinical records |
| Value | The usefulness of insights that can be extracted | Identifying disease risk factors at scale |
Volume, velocity and variety were the original three, capturing the scale, speed and heterogeneity that overwhelm conventional tools. Veracity was added to stress that more data is not automatically better data; noise, bias and gaps must be managed. Value reminds us that the point of all this effort is actionable insight, not collection for its own sake.
Distributed processing: how big data is handled
No single machine can hold or analyse a petabyte efficiently, so big data relies on distributed processing: spreading storage and computation across clusters of many machines that work in parallel. The foundational pattern was MapReduce, which splits a task into pieces, processes them across nodes, then combines the results. Frameworks such as Apache Hadoop and, later, Apache Spark made this approach mainstream, with Spark adding in-memory processing for far greater speed. Cloud platforms now offer this elasticity on demand, letting researchers scale resources to the dataset rather than the other way round.
Big data in research, and its pitfalls
Used well, big data lets researchers detect patterns invisible at small scale, model complex systems and test hypotheses across enormous samples. But scale brings risks. Large datasets can be biased or unrepresentative despite their size, and the volume can lull analysts into ignoring how the data was collected. Crucially, big data does not suspend statistical thinking: with millions of observations, almost any difference becomes statistically significant, which is exactly why effect size matters more than ever, and why a small p-value on its own means little. Big data also fuels machine learning, where larger samples help guard against the overfitting that plagues models trained on too little.
Big data and FAIR principles
The promise of big data depends on the data being usable, and that is where the FAIR principles, that data should be Findable, Accessible, Interoperable and Reusable, become essential. Findability requires rich metadata and persistent identifiers. Interoperability requires shared vocabularies, the kind standardised in the CASRAI dictionary, so that varied sources can be combined meaningfully. Reusability requires clear provenance and licensing. Without these foundations, a large dataset is merely a large liability. Our broader work on standards and metadata, including our guidance for authors and our reproducibility coverage, sets out how to make big research data dependable rather than just big.
Frequently asked questions
How big does data have to be to count as big data?
There is no fixed size. Big data is defined by characteristics, the Vs, rather than a threshold. The practical test is whether traditional tools struggle to store or process it within a useful timeframe.
What are the original three Vs?
Volume, velocity and variety: the scale of the data, the speed at which it arrives, and the diversity of its formats. Veracity and value were added later to address quality and usefulness.
Why is veracity important?
Because size does not guarantee quality. Large datasets can contain errors, bias, duplicates and missing values. Veracity emphasises assessing and improving trustworthiness before drawing conclusions.
How does big data relate to FAIR data?
FAIR principles make big data usable by ensuring it is Findable, Accessible, Interoperable and Reusable. Shared vocabularies and persistent identifiers, such as those in the CASRAI dictionary, let varied large datasets be combined and reused reliably.







