Skip to main content
v2026.1714 entries · CC-BY 4.0
CASRAI

Data science & AI · Reference

What is big data?

Big data refers to datasets so large, fast-moving, or varied that they exceed the capacity of traditional data-processing tools, requiring distributed storage and computation to capture, manage, and analyse them.

The "V" characteristics

Big data is often described by a set of properties beginning with the letter V. The original three, popularised by analyst Doug Laney in 2001, are volume (the sheer quantity of data), velocity (the speed at which it is generated and must be processed), and variety (the mix of structured, semi-structured, and unstructured formats). Two further Vs are widely added: veracity (uncertainty and quality of the data) and value (the usefulness that can be extracted). The point is relative: data is "big" when it strains the available tools.

Why traditional tools fall short

A single conventional database on one machine has finite storage, memory, and throughput. When data exceeds those limits, or arrives faster than it can be processed, single-machine approaches break down.

Big-data technologies address this with distributed systems that spread storage and computation across many machines — frameworks for parallel processing and clustered storage let analyses run on data too large to fit on any one computer. This shift from single-machine to distributed processing is the practical core of the field.

Big data in research

Many research domains now generate big data — genome sequencing, particle physics, astronomy, and large-scale sensor or social data. Handling it raises challenges of storage, transfer, and computation, but also of quality and provenance: large datasets can contain bias, errors, and gaps that are easy to overlook at scale. Sound research practice pairs scale with rigorous data management, so that big datasets remain FAIR — findable, accessible, interoperable, and reusable.

Relationship to data science

Big data and data science are related but distinct: big data names the data and infrastructure challenge, while data science is the discipline of extracting knowledge from data of any size. Many data-science methods, including machine learning, benefit from large datasets, but rigorous data analysis remains essential — more data does not automatically mean more reliable conclusions.

Key facts

At a glance

  • Definition: data too large/fast/varied for traditional tools
  • Original three Vs: volume, velocity, variety (Laney, 2001)
  • Often-added Vs: veracity, value
  • Core technology: distributed storage and parallel computation
  • Relative concept: "big" depends on available tools
  • Research relevance: genomics, physics, astronomy, sensor data

Common questions

FAQ

What are the V's of big data?+

The original three are volume, velocity, and variety, popularised in 2001. Veracity (data quality and uncertainty) and value (usefulness extracted) are commonly added, giving the "five Vs" often quoted today.

How is big data different from ordinary data?+

Big data is data whose volume, velocity, or variety exceeds what traditional single-machine tools can handle, requiring distributed storage and parallel processing. The threshold is relative to the tools available.

Is big data the same as data science?+

No. Big data refers to the data and infrastructure challenge of very large or fast datasets. Data science is the discipline of extracting knowledge from data of any size, which may or may not be "big".

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →