Tag: research data management

  • Electronic lab notebooks and structured record-keeping across the research lifecycle

    When we picture the scholarly record, we tend to think of its end products: the published paper, the deposited dataset, the citation. But each of those is the visible tip of a much larger body of work — the active, day-to-day conduct of research, where experiments are designed and run, samples processed, instruments operated and observations recorded. For generations this working phase was captured, if at all, in the paper laboratory notebook: a bound book on a bench, legible only to its author, locked in a drawer, and disconnected from everything else. An immense amount of crucial information about how research is actually done remained invisible to the wider record. The electronic lab notebook and the structured record-keeping practices around it are changing that. This article looks at how, drawing on the research-lifecycle domain of the CASRAI Dictionary.

    What an electronic lab notebook is

    An electronic lab notebook, or ELN, is software that replaces the paper notebook as the place where researchers record their day-to-day work: experiments, protocols, observations, results and the reasoning behind decisions. At its simplest, an ELN offers obvious practical advantages over paper — it is searchable, backed up, shareable, and resistant to the coffee stains and illegible handwriting that have plagued laboratory science forever. But its deeper significance is that it makes the working record digital and therefore connectable. A paper notebook is an island; an electronic one can be linked to the protocols it follows, the instruments and samples it references, the data files it produces and the people who did the work. The ELN is the point at which the active phase of research enters the connected world that the rest of the record already inhabits.

    Capturing the active phase as connected metadata

    This is the central idea: the ELN lets the active phase of research be captured as connected metadata rather than disappearing into a drawer. When work is recorded electronically and linked properly, a rich web of relationships can be built around it — this experiment used that protocol; it was performed by these people on that instrument; it consumed these samples and produced these data files; it belongs to this project and contributes to that publication. The working phase stops being a black box between the start of a project and its outputs, and becomes a documented, navigable part of the record. This matters for reproducibility, because others can see exactly how a result was produced; for collaboration, because the record is shared rather than siloed; and for integrity, because the chain from question to result is visible rather than reconstructed after the fact.

    FAIR principles for the working record

    The same FAIR principles — Findable, Accessible, Interoperable, Reusable — that govern published data apply, with equal force, to the records created during the active phase. An ELN that captures structured, well-described records makes the working record findable and reusable in a way a paper notebook never could be. The principle is that good data management should not begin at the moment of deposit, when a project ends, but should run through the entire lifecycle, starting at the bench. If records are created in a structured, connected form from the outset, preparing data for deposit becomes a matter of harvesting and tidying what already exists, rather than reconstructing it. Good record-keeping during the active phase is, in this sense, the foundation of good data management overall.

    Provenance: the PROV standard

    A particular strength of structured electronic record-keeping is its capacity to capture provenance — the record of how something came to be: what data was used, what processes acted on it, what agents (people, software, instruments) were involved, and in what order. Provenance is the basis of trust in a result, because it lets others trace exactly how that result was produced and verify each step. The PROV standard provides a formal, machine-readable model for expressing provenance — describing the entities, activities and agents in a process and the relationships between them — so that the chain of how a result was produced can be recorded consistently and understood across systems. An ELN that captures provenance in line with such a standard turns the working record into something far more powerful than a diary: a verifiable account of how knowledge was made.

    Identifying the work itself: activity identifiers

    If the active phase is to be connected to the rest of the research landscape, the work itself needs to be identifiable. Persistent identifiers have transformed how we refer to outputs and people; the same logic is now being applied to research activities. RAiD (the Research Activity Identifier) is a persistent identifier for research projects and activities, providing a stable handle for the work itself — not just its eventual outputs. With an activity identifier, the records captured in an ELN, the data produced, the people involved and the resulting publications can all be tied to a single, persistent identity for the project. The whole arc of a piece of research — from the work as it happens to the products it yields — can then be traced as a connected whole rather than a set of disconnected fragments.

    A consistent vocabulary across the lifecycle

    For records created at the bench to connect with everything downstream — data repositories, CRIS platforms, publications — the elements they contain must mean the same thing everywhere: what a protocol, a sample, an instrument or an activity denotes. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the record captured in an electronic lab notebook is understood identically wherever it flows. And because the work recorded there — investigation, data curation, methodology — is genuine contribution, it can be described in the same framework used for every output, the CRediT taxonomy and its full set of contribution roles. The electronic lab notebook brings the most hands-on phase of research into the connected record; structured record-keeping, provenance and activity identifiers let that phase take its rightful place in the story of how knowledge is made.

  • Crediting data stewards and curators: recognising RDM professionals

    Behind every well-managed research dataset there is usually a person whose name does not appear on the paper. They are the ones who organised the data so it made sense, wrote the documentation that explains what each variable means, checked it for errors, chose appropriate formats, ensured it was deposited under the right licence, and made it findable and reusable. This is the work of data stewards and curators — demanding, skilled professional labour that turns a heap of files into an asset that can be trusted and reused. Yet because it does not fit the traditional shape of authorship, it is frequently invisible in the scholarly record. This article makes the case for recognising it properly, drawing on the CRediT-extensions domain of the CASRAI Dictionary.

    The work behind FAIR data

    The aspiration that research data should be FAIR — Findable, Accessible, Interoperable and Reusable — is now widely shared, but it is easy to forget that FAIR data is not a natural state. Data does not become findable, well-documented and reusable on its own; someone has to make it so. Achieving each FAIR principle is real work: findability requires good metadata and persistent identifiers; interoperability requires standard formats and vocabularies; reusability requires thorough documentation, clear licensing and quality checking. This is precisely the work data stewards and curators do. They are, in effect, the people who deliver FAIR in practice, translating an admirable principle into actual datasets that other researchers can find and use. Recognising their contribution is therefore not a courtesy; it is acknowledging the people who make one of open science’s central goals achievable at all.

    The recognition gap

    The difficulty is that the reward systems of research were built around a narrower idea of contribution. Recognition has long been anchored in authorship of articles and the metrics derived from them, and someone whose contribution is curating the data rather than writing the paper can find there is no obvious place for them. They may have spent months making a dataset usable, only to be absent from the byline and, at most, thanked vaguely in an acknowledgement. This invisibility has consequences beyond unfairness. It makes data-management careers harder to sustain, because contribution that cannot be pointed to cannot easily support promotion; and it weakens the incentive to do the work well, because diligent curation goes unrewarded while the data that depends on it is taken for granted. A research system that wants FAIR data but does not recognise the people who produce it works against its own aims.

    The CRediT Data curation role

    One of the most direct ways to close this gap already exists within the standard vocabulary of contribution. The CRediT taxonomy includes a role that names this work explicitly: Data curation, defined as management activities to annotate (produce metadata), scrub data and maintain research data — including the software code where needed to interpret the data itself — for initial use and later reuse. That definition is almost a job description for a data steward. By assigning the Data curation role, a contributorship statement records the steward’s or curator’s work in the same structured form used for every other contributor, in the same place readers and evaluators look. The work appears in the formal record as a recognised contribution rather than disappearing into a line of thanks. The broader question of how contribution taxonomies are being adapted and extended for roles like these is the concern of the CRediT-extensions domain, and the principles of who counts as a contributor connect closely to authorship more generally.

    Beyond a single role

    It is worth being honest that a single role does not capture everything a data professional does. Their contribution often spans several activities, and a fair statement may reflect more than one:

    • Data curation for the core work of annotating, cleaning and maintaining the data.
    • Methodology where they helped design how data would be captured and structured.
    • Software where they built tools or scripts to process or document the data.
    • Validation where they verified the integrity and quality of the data and its outputs.

    The point is not to inflate credit but to describe contribution accurately. Data professionals are not a single undifferentiated category; using the appropriate roles, and more than one where warranted, gives a truthful picture of skilled, multifaceted work — which is what honest recognition requires.

    The professionalisation of research data management

    Recognition in individual outputs is part of a larger development: the professionalisation of research data management. Data stewardship is increasingly understood as a profession with its own expertise, training, standards and career structures, rather than a task done in spare moments by whoever is available. Dedicated data-steward and curator roles are appearing in institutions; training and competency frameworks for data professionals are maturing; and the field is acquiring the identity and standing that mark an established profession. This matters because recognition operates at two levels that reinforce each other. Crediting contributions in outputs makes individual work visible; building data management into a recognised profession makes it a viable career. Visible contributions strengthen the case for professional careers, and professional careers ensure there are skilled people to make the contributions. FAIR data depends on both being in place.

    A consistent vocabulary for data work

    For the contributions of data stewards and curators to be recognised consistently — across institutions, repositories, publishers and reporting systems — the way that work is described must mean the same thing everywhere. A Data curation role recorded in one system must be understood identically in another. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the professional work of curating and stewarding data is understood and credited the same way wherever it appears. The recognition of data professionals is also a concern of research administration, where contributions, careers and the systems that record them come together. FAIR data is one of open science’s great ambitions; recognising the people who make data FAIR — in the record and in their careers — is how that ambition is sustained.

  • Machine-actionable data management plans: the maDMP comes of age

    The data management plan has a reputation problem. For most of its existence it has been a document written under deadline pressure to satisfy a funder requirement, deposited as a PDF, and then never opened again. It describes intentions that, by the end of a project, may bear little resemblance to what actually happened to the data. The machine-actionable DMP is the response to that failure mode, and after some years of standards work it has come of age. This article explains what it is and why it matters, drawing on the machine-actionable DMPs domain.

    From document to data object

    A data management plan (DMP) is a description of the data-management practices to be followed during and after a research project: what data will be produced, how they will be stored and documented, under what licence and access conditions they will be shared, and how long they will be kept. A machine-actionable DMP (maDMP) is the same content expressed as structured data that research systems can exchange, validate, ingest, and update automatically, rather than as prose only a human can read.

    The distinction is not cosmetic. A prose DMP states that data will be deposited in a trusted repository; a maDMP carries that as a structured assertion that a repository system can read, act on, and later check against what was actually deposited. The DMP stops being a one-time document and becomes a node in the research-information graph, connected to the project, the outputs, the funder, and the people.

    The standard that made it possible: the RDA Common Standard

    Structured exchange requires an agreed structure, and that is the contribution of the RDA DMP Common Standard — the application profile developed by the Research Data Alliance to represent maDMP content in a common, system-neutral form. It defines the entities a DMP describes and the relationships between them, so that a DMP created in one tool means the same thing when read by another.

    The standard’s design encodes a useful distinction the prose form blurs: between an anticipated dataset — a dataset the DMP says will be produced — and a realised dataset, one that has actually been produced and, typically, deposited. A maDMP can carry both, which is precisely what lets a system at closeout check whether the datasets the plan anticipated were in fact realised and deposited. Around these sit the structured fields that prose tends to leave vague: the retention period, the licence assertion, the access control policy, the storage location, and a data volume estimate for storage planning.

    The DMP ID: giving the plan an identity

    For a DMP to be referenced across systems, it needs an identity, and that is the role of the DMP ID — a persistent identifier for a specific data management plan, typically a DOI minted by DataCite through tools such as the DMPTool, the DCC’s DMPonline, or ARGOS. With a DMP ID, the plan can be cited like any other research object: a funder can refer to it, a CRIS can link to it, an output can point back to the plan that anticipated it, and the connections become part of the persistent-identifier graph alongside ORCID, ROR, and the grant ID. The DMP ID is what turns the DMP from a loose attachment into a first-class, addressable entity in the persistent-identifier ecosystem.

    The living DMP

    The deepest change the maDMP enables is conceptual: the move from the frozen DMP to the living DMP — a plan updated throughout the project lifecycle rather than fixed at award. A frozen DMP is a prediction made at the least-informed moment of a project, before any data exist. A living DMP is a record that tracks reality: as anticipated datasets become realised, as storage decisions change, as access conditions are settled, the plan is updated, and a DMP version captures each snapshot.

    The frozen DMP answers the question “what did the applicant promise at award?” The living maDMP answers a far more useful question: “what is actually happening to this project’s data, right now?” Only the second is worth the effort of maintaining.

    This is where maDMP exchange earns its keep. When the DMP is structured and identified, a change made in one system can propagate — from a DMP tool to a CRIS, from the CRIS to a repository — so that the plan stays current without re-keying. A scheduled DMP review event becomes a checkpoint against live data rather than a re-reading of a stale document, and a DMP completeness score can be computed automatically against the funder’s required elements.

    Why funders and institutions want this

    The maDMP is not an end in itself; it is wanted because it makes obligations checkable. A funder that requires data to be deposited in a trusted repository under an open licence can, with structured maDMPs, verify that the realised datasets meet the commitment, rather than trusting a final-report paragraph. An institution can monitor data-management compliance across its whole portfolio as a query over structured plans. And the researcher, crucially, benefits too: a living maDMP linked to the project’s outputs means the closeout data-management report is largely assembled already, not reconstructed from memory. This is the same dividend that structured grant and disclosure data pay throughout research administration.

    Where shared vocabulary fits

    The RDA Common Standard supplies the structure — the shape of a maDMP. It does not, on its own, fix the controlled values that populate it: the list of access categories, the licence vocabulary, the dataset-status terms. Two systems can both emit valid Common Standard maDMPs and still disagree on what “restricted access” or “realised” means. That definitional gap, below the structural model, is exactly what a shared, federated vocabulary fills, pointing back to the RDA for the standard and to DataCite for the DMP ID infrastructure. Supplying it is the role the CASRAI dictionary is built for.

    What to do now

    For researchers and data stewards: treat the DMP as a living, structured object with a DMP ID, updated as anticipated datasets become realised. For funders: ask for maDMPs against the RDA Common Standard and verify realised against anticipated at closeout. For standards work: pair the structural standard with shared value vocabularies so that maDMPs from different tools genuinely interoperate.

    Related reading

  • Data lifecycle management: the DCC Curation Lifecycle Model

    Research data is often treated as if it has only two moments that matter: when it is collected and when it is published. Everything in between is left to chance. Yet data that is well collected but poorly managed can become unusable within a few years: file formats fall out of support, the meaning of variables is forgotten, copies multiply and diverge, and the person who understood it moves on. Treating data as a thing to be looked after across its whole existence, rather than captured once and forgotten, is the essence of data lifecycle management. The most influential map of that lifecycle is the Digital Curation Centre’s Curation Lifecycle Model, which provides a structured way to think about the journey data takes — a journey at the heart of the research-lifecycle domain of the CASRAI Dictionary.

    Why curation is continuous

    The central insight of the lifecycle view is that curation is an active, continuous process, not a one-off task performed at the end. It is tempting to imagine that data can be generated freely and tidied up later. In practice, the decisions that determine whether data will survive and remain usable are made throughout: how it is structured and documented as it is created, how it is stored while in use, what is kept and what is discarded, and how it is prepared for the long term. Leaving all of this to the end means leaving it too late — documentation that was obvious at the time is forgotten, and choices that should have been deliberate are made by default. The Digital Curation Centre, a UK centre of expertise, developed its model precisely to make these activities visible and deliberate across the whole life of the data.

    The shape of the model

    The Curation Lifecycle Model is usually drawn as a series of concentric rings around the data at the centre. At its core sit the digital objects and databases being curated. Surrounding them are full lifecycle actions — activities that apply throughout, not at a single stage. These include description and representation information (the metadata and documentation that make data understandable), preservation planning, community watch and participation (keeping up with standards and tools), and the overarching work of curating and preserving. Around these run the sequential actions that the data passes through over time. The genius of the model is in holding both ideas at once: some curation work happens at particular moments in sequence, while other work — above all documentation and preservation planning — must be sustained continuously throughout.

    The sequence of actions

    The sequential part of the model traces data through its life:

    • Conceptualise. Plan how data will be created and managed before any of it exists — the planning a data management plan captures, a discipline introduced at our learning hub.
    • Create or receive. Generate the data, or take it in, with the metadata and documentation it needs from the outset.
    • Appraise and select. Decide which data should be kept for the long term, judged against guidance and policy. Not everything need be preserved forever; deciding deliberately is itself curation.
    • Ingest. Transfer the selected data into a repository or archive that will look after it.
    • Preservation action. Take the steps that keep data usable over time — format migration, integrity checks and the rest.
    • Store. Keep the data securely and reliably.
    • Access, use and reuse. Make the data available to those entitled to it, for the purposes that justify keeping it.
    • Transform. Create new data from the original, which then re-enters the lifecycle in its own right.

    The model also includes occasional actions — reappraisal, migration and, where appropriate, disposal of data that should not be retained — acknowledging that curation involves honest decisions about what not to keep as well as what to preserve.

    Appraisal: the decision at the centre of curation

    Of all these stages, appraisal and selection deserves particular emphasis, because it is where lifecycle thinking departs most sharply from the instinct to keep everything. Storing data indefinitely is neither free nor harmless: it consumes resources, and a vast undifferentiated mass of poorly described data is hard to use. Appraisal is the disciplined judgement about what has lasting value — what should be preserved because it could be reused, verified or is too costly to reproduce — and what can responsibly be let go. Making that judgement well, against clear policy, is one of the most professional acts in data management, and the lifecycle model puts it where it belongs: a deliberate decision point, not an accident of neglect.

    Preservation in service of reuse

    It is worth being clear about why all this effort is undertaken. The point of preservation is not to lock data away but to keep it usable, because the ultimate purpose of curation is reuse. Data that has been appraised, documented, preserved and made accessible can be verified by others, combined with new data, and built upon in ways its creators never anticipated. This is the payoff that justifies the whole lifecycle: well-curated data is an asset that keeps giving, while neglected data is a sunk cost that decays. The model makes the connection explicit by placing reuse alongside preservation, a reminder that curation serves a purpose beyond mere safekeeping.

    A consistent vocabulary across the lifecycle

    For data to move smoothly through these stages — across the tools, repositories and systems involved — the information describing it must mean the same thing at every step. Metadata created at capture must be understood by the repository that ingests it; reuse depends on description that travels intact. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so the information accompanying data is understood identically wherever it flows. And because curating data is genuine, recognisable contribution, the work can be described using the same framework as any other — the CRediT taxonomy, whose Data curation role names exactly this activity. The lifecycle model shows that good data does not happen by accident; sustained curation, supported by shared description, turns data collected once into data usable for years.

  • Evaluating data management plans: how funders and institutions review DMPs

    Data management plans have become a near-universal requirement. Funders ask for them at the proposal stage, institutions increasingly expect them, and researchers have largely accepted that planning for data is part of designing a project. But requiring a plan and getting a good plan are two very different things. A DMP written hastily to satisfy a requirement, glanced at once and never looked at again, achieves almost nothing — it is a box ticked, not a commitment made. The harder, less-discussed half of the DMP story is evaluation: how plans are actually reviewed, against what criteria, by whom, and with what consequences. As DMPs mature, attention is rightly shifting from whether they exist to whether they are any good. This article examines DMP evaluation, drawing on the machine-actionable DMP domain of the CASRAI Dictionary.

    Why evaluation matters

    The case for taking DMP review seriously is straightforward. If a plan is never assessed, there is little incentive to write a good one, and the requirement degenerates into a formality that consumes effort without improving practice. Evaluation is what gives a DMP teeth: it signals that the plan is expected to be substantive, it provides researchers with feedback they can act on, and it lets funders identify proposals where the data-handling arrangements are inadequate or unrealistic. A reviewed DMP is a commitment someone has engaged with; an unreviewed DMP is a wish.

    Rubrics and review criteria

    To review plans fairly and consistently, reviewers need criteria, and this has driven the development of DMP rubrics — structured frameworks that lay out what a good plan should address and how to judge it. A rubric breaks the assessment down into components and gives reviewers a consistent basis for judging each one, so that plans are evaluated against the same expectations rather than according to each reviewer’s personal sense of what matters. Typical dimensions a rubric covers include:

    • Data description. Is it clear what data will be produced or used, in what formats and volumes?
    • Documentation and metadata. Will the data be documented well enough to be understood and reused?
    • Storage and security. Are arrangements for storing and protecting the data, including any sensitive data, adequate?
    • Preservation and sharing. Where will the data be deposited, under what access conditions and licence, and for how long?
    • Ethical and legal compliance. Are consent, privacy and legal obligations properly addressed?
    • Roles and resources. Is it clear who is responsible, and are the resources to do this realistic?

    One prominent example is the DART (Data management plan Analysis, Reporting and Tracking) rubric, developed to help institutions and reviewers assess DMPs systematically and consistently. Tools and rubrics of this kind matter because they turn “is this a good plan?” — a vague and subjective question — into a structured assessment that different reviewers can apply in comparable ways.

    Funder assessment in practice

    Funders approach DMP assessment in different ways and at different points. Some review the plan as part of the proposal, treating the quality of data-handling arrangements as one factor in deciding what to fund. Others emphasise the DMP as a project deliverable, expecting it to be developed and updated as the project proceeds. In either case, the trend is towards taking the plan seriously as something to be engaged with, not merely collected. There is a balance to strike: assessment should be rigorous enough to improve practice but proportionate enough not to impose a heavy burden. A purely bureaucratic review risks producing better-written but no better-managed data; the aim is to improve what actually happens to the data, not just the prose describing it.

    Feedback loops

    Perhaps the most valuable, and most often neglected, aspect of DMP evaluation is the feedback loop. Assessment is most useful when it is not merely a gate — pass or fail — but a source of guidance that helps researchers improve their plans and their practice. Feedback can flow in several directions:

    • To the researcher, pointing out weaknesses and suggesting improvements, ideally early enough to act on.
    • Into the project, where a plan reviewed at the start can be revisited and updated as the work develops and the data takes shape.
    • Back to support services, where patterns across many plans reveal where researchers commonly struggle, so that training and support can be targeted.

    Feedback is what turns evaluation from a judgement into a constructive tool. A plan that comes back with specific, actionable comments helps the researcher do better; a plan that simply passes or fails teaches nothing.

    Machine-actionable checks

    The move towards machine-actionable DMPs (maDMPs) opens a powerful possibility for evaluation: automating the parts of review that can be automated. When a plan is expressed as structured, machine-readable data rather than free prose, certain checks no longer require a human. A system can verify whether a repository has been specified, whether a licence has been chosen, whether an identifier has been minted, or whether commitments are consistent with funder policy. This does not replace expert human judgement — assessing whether the chosen approach suits the research still requires understanding — but it can handle the routine, checkable elements automatically, freeing reviewers to focus on the judgements that genuinely need them. Machine-actionable checks can also run continuously, so that a living plan is monitored against its commitments throughout a project rather than assessed only once.

    A shared vocabulary for review

    For DMP evaluation to work consistently — across funders, institutions and the tools that support planning — the elements being reviewed and the criteria applied must mean the same thing everywhere. A plan written against one set of expectations and reviewed against another, or described in terms a reviewing system cannot interpret, defeats the purpose. That consistency is what the CASRAI Dictionary supports: a shared vocabulary so that the components of a data management plan are understood identically by those who write them and those who review them, supporting sound research administration. And because reviewing and supporting data management is genuine contribution, the work can be described in the same framework used for every other — the CRediT taxonomy and its full set of contribution roles. A DMP is only as valuable as the seriousness with which it is reviewed; good evaluation is what turns the plan from a promise into a practice.