Tag: Earth System Science Data

  • Data papers: publishing datasets as citable outputs

    Some of the most valuable products of research are datasets: a long-running environmental monitoring series, a carefully curated genomic resource, a survey assembled over years. Such a dataset can underpin dozens of later studies and outlast the project that created it. Yet the people who built it have often struggled to get formal credit, because the traditional unit of academic recognition is the journal article that interprets data, not the data themselves. The data paper exists to close that gap: a peer-reviewed article whose subject is a dataset — describing what it contains, how it was produced and how to reuse it — turning data work into a citable, reviewable output in its own right. This article explains how data papers work and why they matter, drawing on the research outputs domain of the CASRAI Dictionary.

    What a data paper is — and is not

    A data paper is not a research paper that happens to share its data, and it is not a results paper in disguise. Its purpose is descriptive: to document a dataset thoroughly enough that others can find, understand, trust and reuse it. A typical data paper covers what the data are, how and why they were collected, the methods and instruments used, the structure and format of the data, quality-control and validation procedures, and — crucially — where the data are deposited and under what licence. What a data paper generally does not do is advance a new scientific hypothesis or interpret the data to reach a novel conclusion; the contribution is the well-described, reusable resource itself. This restraint is the point: it lets the value of the data be assessed on its own terms, separately from any particular analysis.

    Data journals and where data papers appear

    Data papers are published either in dedicated data journals or in conventional journals that accept the format. Two well-established examples illustrate the model. Scientific Data publishes peer-reviewed descriptions of datasets across the sciences, pairing each with structured metadata. Earth System Science Data publishes data papers in the Earth and environmental sciences, with a strong emphasis on data quality and reusability. These venues apply genuine peer review — reviewers assess whether the data are sound, complete, properly documented and genuinely reusable — which is what gives a data paper its credibility. A peer-reviewed data paper is not merely a deposit; it is a vetted statement that the dataset meets a scholarly standard.

    The relationship between the paper and the data

    A central feature of the data paper model is the separation of the description from the data. The data paper is the human-readable, peer-reviewed article; the dataset itself lives in a repository, where it receives its own persistent identifier — typically a DataCite DOI — and is governed by an explicit licence. The data paper cites the dataset by that identifier, and the dataset record points back to the paper. This means there are two citable objects, linked but distinct: the dataset, which others cite when they reuse the data, and the data paper, which others cite when they draw on its description. Robust dataset citation through DataCite is what allows reuse of the data to be tracked and, over time, credited to the people who produced it. The infrastructure that makes datasets first-class citable objects is part of the wider picture covered in our data infrastructure domain.

    Why data papers matter for credit and FAIR data

    The deeper reason data papers matter is incentives. For a long time, the rational move for a researcher who built a valuable dataset was to mine it for conventional papers, because that was what counted. The data paper changes the calculus by making the dataset itself a recognised, citable, peer-reviewed output that appears on a CV and accrues citations. That recognition rewards exactly the careful, time-consuming data stewardship that the research system otherwise undervalues. Data papers also advance the FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable — almost by construction: a good data paper makes a dataset findable (through publication and a DOI), documents it for accessibility and interoperability, and exists precisely to enable reuse.

    Crediting the people behind the data

    Producing a high-quality dataset is collaborative work — collection, curation, validation, documentation — and a data paper is an opportunity to credit it properly rather than burying it in an acknowledgement. The CRediT taxonomy maps naturally onto this work, with the Data curation role recognising the management, annotation and maintenance of the data, alongside Investigation for collection and Methodology for how it was produced. The complete set of roles is described in our overview of the CRediT roles. Applying structured contribution to a data paper ensures that the curator who made the dataset reusable is named for that contribution, not left invisible behind the names of those who later analyse the data.

    An output worth treating seriously

    Treating datasets as citable, reviewable outputs — with their own identifiers, their own peer review, and their own credit — recognises a simple reality: the data often outlast and out-influence any single paper drawn from them. Data papers give that reality formal standing. The consistent vocabulary that lets a dataset, a data paper and the contributions behind them be described the same way across repositories, journals and institutional systems is maintained in the CASRAI Dictionary, so that the credit a researcher earns for building a resource travels with it wherever it is reused.