Tag: data steward

  • Trusted repositories and the EOSC: where research data should live

    Open and FAIR data has to live somewhere, and the choice of where is not a clerical detail. A dataset deposited on a personal web page, a lab server, or a service that may not exist in five years is, for the purposes of long-term reuse, lost. The question of where research data should live is the question of trusted repositories, and the European answer to coordinating them is the EOSC. This article maps the landscape, drawing on the data-infrastructure domain.

    What makes a repository trustworthy

    Not every place that can store a file is fit to be the home of the scholarly record. A trusted digital repository is one assessed against a recognised trust framework, demonstrating that it has the organisational and technical capability to preserve and provide access to data over the long term. Trust here is not a vibe; it is a set of demonstrable properties — a sustainability plan, preservation procedures, persistent identifiers, clear access conditions, and the organisational continuity to outlast any individual project or grant.

    The most widely recognised certification of these properties is CoreTrustSeal, a community-governed assessment that a repository meets the core requirements of trustworthy data stewardship. A CoreTrustSeal certification is a concrete signal a funder or researcher can rely on: it means an independent process has checked that the repository can actually do what “long-term preservation” implies. When a funder mandate says data must go to a trusted repository, CoreTrustSeal is the most common way that word is given operational meaning.

    The repository taxonomy: generalist and domain

    Trusted repositories come in two broad kinds, and choosing well between them is one of the most consequential data-management decisions a researcher makes.

    • A generalist repository accepts data from any discipline. Zenodo, Figshare, and Dryad are the familiar examples: they mint a DOI, accept almost any data type, and provide a reliable, citable home when no specialist option exists. They are the right default for the long tail of research data that has no natural disciplinary home.
    • A domain repository is discipline-specific, built around the data types, standards, and community of a particular field. GenBank for nucleotide sequence data is the archetype; there are equivalents across crystallography, astronomy, social science, proteomics, and more. A domain repository adds what a generalist cannot: discipline-specific metadata standards, validation, and a community of expert users who will actually find and reuse the data.

    The practical rule that funders increasingly articulate is: deposit in the appropriate domain repository where one exists, and fall back to a trusted generalist repository where it does not. A sequence belongs in GenBank, not in a generic store; a one-off dataset with no community home belongs in a generalist repository with a DOI rather than on a server that will be decommissioned.

    The EOSC: coordinating the federation

    Individual trusted repositories are necessary but not sufficient. A researcher also needs to find the right one, move data and compute between services, and trust that the pieces interoperate. In Europe, the coordinating layer for this is the European Open Science Cloud (EOSC) — a federation of research-data services rather than a single monolithic platform.

    The EOSC’s model is federation: an EOSC node is a service provider connected to the federation, and an EOSC service is something offered through its catalogue — a repository, a compute resource, a data-management tool. The aspiration is that a researcher can discover trusted repositories, deposit data, and compose data with compute across institutional and national boundaries, through a coordinated catalogue rather than a patchwork of disconnected services. The EOSC is, in effect, the European attempt to make “where should this data live?” answerable through one front door onto many trustworthy providers. It is not the only such effort — the African Open Science Platform pursues a comparable continental federation — but it is the most developed.

    The human layer: stewards and custodians

    Infrastructure does not curate itself, and an honest account of where data should live has to name the people. A data steward is the professional responsible for data quality, governance, and ongoing curation — the role that makes the difference between data that is merely deposited and data that is genuinely reusable. A data custodian holds legal or operational responsibility for the data. Around them sit the structured agreements that govern sharing: a data sharing agreement setting the conditions under which data move between parties, an embargo period deferring public access after deposit, and access controls distinguishing open, restricted, and metadata-only data.

    A trusted repository with no data steward behind the data is a safe building with empty rooms. Preservation is an organisational commitment carried out by people, not a property that storage acquires on its own.

    Why this connects to FAIR and to identifiers

    Where data lives is what makes the FAIR principles operational. Findability depends on the repository minting a persistent identifier and exposing good metadata; accessibility depends on stable resolution and clear access conditions; interoperability and reusability depend on the standards a domain repository enforces. A trusted repository is, in practice, the machine that turns the FAIR aspiration into a deposited reality — which is why the choice of repository, and the trust signal of CoreTrustSeal, matters as much as the decision to share at all. The repository is also where the data’s persistent identifier enters the broader graph that links it to the project, the people, and the funding.

    Where shared vocabulary fits

    The terms in this domain are used loosely in funder mandates and policies — “trusted”, “appropriate”, “long-term” all mean different things to different bodies, and “generalist” versus “domain” is often left implicit. A shared, federated vocabulary that defines these precisely, pointing to CoreTrustSeal for the trust framework and to the EOSC for the federation model, is what lets a data-sharing requirement be stated unambiguously and checked. Supplying that definitional layer is the role the CASRAI dictionary is designed to play.

    What to do now

    For researchers: deposit in the appropriate domain repository where one exists, otherwise a CoreTrustSeal-certified generalist repository, and never a personal or project server for the long term. For institutions: invest in data stewards, not just storage. For funders and standards work: give “trusted repository” operational meaning through certification and shared vocabulary, and support the federations that make trustworthy services findable.

    Related reading

  • Crediting data stewards and curators: recognising RDM professionals

    Behind every well-managed research dataset there is usually a person whose name does not appear on the paper. They are the ones who organised the data so it made sense, wrote the documentation that explains what each variable means, checked it for errors, chose appropriate formats, ensured it was deposited under the right licence, and made it findable and reusable. This is the work of data stewards and curators — demanding, skilled professional labour that turns a heap of files into an asset that can be trusted and reused. Yet because it does not fit the traditional shape of authorship, it is frequently invisible in the scholarly record. This article makes the case for recognising it properly, drawing on the CRediT-extensions domain of the CASRAI Dictionary.

    The work behind FAIR data

    The aspiration that research data should be FAIR — Findable, Accessible, Interoperable and Reusable — is now widely shared, but it is easy to forget that FAIR data is not a natural state. Data does not become findable, well-documented and reusable on its own; someone has to make it so. Achieving each FAIR principle is real work: findability requires good metadata and persistent identifiers; interoperability requires standard formats and vocabularies; reusability requires thorough documentation, clear licensing and quality checking. This is precisely the work data stewards and curators do. They are, in effect, the people who deliver FAIR in practice, translating an admirable principle into actual datasets that other researchers can find and use. Recognising their contribution is therefore not a courtesy; it is acknowledging the people who make one of open science’s central goals achievable at all.

    The recognition gap

    The difficulty is that the reward systems of research were built around a narrower idea of contribution. Recognition has long been anchored in authorship of articles and the metrics derived from them, and someone whose contribution is curating the data rather than writing the paper can find there is no obvious place for them. They may have spent months making a dataset usable, only to be absent from the byline and, at most, thanked vaguely in an acknowledgement. This invisibility has consequences beyond unfairness. It makes data-management careers harder to sustain, because contribution that cannot be pointed to cannot easily support promotion; and it weakens the incentive to do the work well, because diligent curation goes unrewarded while the data that depends on it is taken for granted. A research system that wants FAIR data but does not recognise the people who produce it works against its own aims.

    The CRediT Data curation role

    One of the most direct ways to close this gap already exists within the standard vocabulary of contribution. The CRediT taxonomy includes a role that names this work explicitly: Data curation, defined as management activities to annotate (produce metadata), scrub data and maintain research data — including the software code where needed to interpret the data itself — for initial use and later reuse. That definition is almost a job description for a data steward. By assigning the Data curation role, a contributorship statement records the steward’s or curator’s work in the same structured form used for every other contributor, in the same place readers and evaluators look. The work appears in the formal record as a recognised contribution rather than disappearing into a line of thanks. The broader question of how contribution taxonomies are being adapted and extended for roles like these is the concern of the CRediT-extensions domain, and the principles of who counts as a contributor connect closely to authorship more generally.

    Beyond a single role

    It is worth being honest that a single role does not capture everything a data professional does. Their contribution often spans several activities, and a fair statement may reflect more than one:

    • Data curation for the core work of annotating, cleaning and maintaining the data.
    • Methodology where they helped design how data would be captured and structured.
    • Software where they built tools or scripts to process or document the data.
    • Validation where they verified the integrity and quality of the data and its outputs.

    The point is not to inflate credit but to describe contribution accurately. Data professionals are not a single undifferentiated category; using the appropriate roles, and more than one where warranted, gives a truthful picture of skilled, multifaceted work — which is what honest recognition requires.

    The professionalisation of research data management

    Recognition in individual outputs is part of a larger development: the professionalisation of research data management. Data stewardship is increasingly understood as a profession with its own expertise, training, standards and career structures, rather than a task done in spare moments by whoever is available. Dedicated data-steward and curator roles are appearing in institutions; training and competency frameworks for data professionals are maturing; and the field is acquiring the identity and standing that mark an established profession. This matters because recognition operates at two levels that reinforce each other. Crediting contributions in outputs makes individual work visible; building data management into a recognised profession makes it a viable career. Visible contributions strengthen the case for professional careers, and professional careers ensure there are skilled people to make the contributions. FAIR data depends on both being in place.

    A consistent vocabulary for data work

    For the contributions of data stewards and curators to be recognised consistently — across institutions, repositories, publishers and reporting systems — the way that work is described must mean the same thing everywhere. A Data curation role recorded in one system must be understood identically in another. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the professional work of curating and stewarding data is understood and credited the same way wherever it appears. The recognition of data professionals is also a concern of research administration, where contributions, careers and the systems that record them come together. FAIR data is one of open science’s great ambitions; recognising the people who make data FAIR — in the record and in their careers — is how that ambition is sustained.