Examples
Worked examples
- Is an instance
An institutional research-IT data lake holding raw genomics FASTQ files, microscopy images, and instrument logs.
- Is an instance
An astronomical observatory's S3-based data lake ingesting raw telescope outputs prior to pipeline processing.
Counter-examples
Looks similar, but isn't
- Not an instance
A small structured database is not a data lake.
- Not an instance
A curated, schema-on-write data warehouse is the contrasting pattern, not a data lake.
Editorial commentary
The data-lake pattern emerged in the early 2010s (notably promoted by James Dixon at Pentaho) as a counter-position to traditional data warehousing. Modern data lakes are typically built on object storage (S3, Azure Blob, GCS), with optional table layers (Apache Iceberg, Delta Lake, Hudi) and compute engines (Spark, Trino). In research-information contexts, data lakes are used for raw observational data (telemetry, instrument exhaust, log files) and for aggregating large heterogeneous corpora before downstream curation.
References
- D
- i
- x
- o
- n
- J
- .
- ,
- ‘
- P
- e
- n
- t
- a
- h
- o
- ,
- H
- a
- d
- o
- o
- p
- ,
- a
- n
- d
- D
- a
- t
- a
- L
- a
- k
- e
- s
- ‘
- (
- P
- e
- n
- t
- a
- h
- o
- b
- l
- o
- g
- ,
- 2
- 0
- 1
- 0
- )
- .
- R
- u
- s
- s
- o
- m
- P
- .
- ,
- ‘
- D
- a
- t
- a
- L
- a
- k
- e
- s
- :
- P
- u
- r
- p
- o
- s
- e
- s
- ,
- P
- r
- a
- c
- t
- i
- c
- e
- s
- ,
- P
- a
- t
- t
- e
- r
- n
- s
- ,
- a
- n
- d
- P
- l
- a
- t
- f
- o
- r
- m
- s
- ‘
- (
- T
- D
- W
- I
- B
- e
- s
- t
- P
- r
- a
- c
- t
- i
- c
- e
- s
- R
- e
- p
- o
- r
- t
- ,
- 2
- 0
- 1
- 7
- )
- .
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="Data lake"
vocab-term-identifier="https://casrai.org/dictionary/term/data-lake" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "Data lake",
"identifier": "https://casrai.org/dictionary/term/data-lake",
"description": "A storage repository that holds large volumes of structured, semi-structured, and unstructured data in their native formats, deferring schema-on-write requirements so that data can be ingested cheaply and only structured at the time of read or analysis.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/research-data-infrastructure/",
"url": "https://casrai.org/dictionary/term/data-lake",
"sameAs": [],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







