Category: Guides & Explainers

Practical how-to guides, templates, checklists, and career pathways for research administrators, authors, and institutional teams.

  • Variance in Statistics: Definition and Formula

    Variance is a measure of how spread out a set of values is, defined as the average of the squared deviations of each value from the mean. A large variance means the data points are widely dispersed; a small variance means they cluster tightly around the mean. Because the deviations are squared, variance is always non-negative and is expressed in squared units of the original measurement.

    The definition of variance

    To calculate variance, you first find the mean of the data, then subtract the mean from each value to get the deviations. Squaring each deviation removes the sign (so positive and negative deviations do not cancel) and gives greater weight to values far from the mean. The average of these squared deviations is the variance.

    Variance is the foundation of many statistical methods, including the analysis of variance (ANOVA), regression diagnostics and the construction of confidence intervals. Reporting it transparently supports the goals set out in our reproducibility coverage.

    Population variance versus sample variance

    The formula depends on whether your data are the entire population or a sample drawn from it. For a population, you divide the sum of squared deviations by the number of values, N. For a sample, you divide by n − 1 instead of n. This adjustment, known as Bessel’s correction, produces an unbiased estimate of the population variance, because using the sample mean slightly underestimates the spread.

    Quantity Symbol Divisor
    Population variance σ² N
    Sample variance n − 1

    A worked conceptual example

    Suppose five replicate measurements give 4, 8, 6, 5 and 2. The mean is (4 + 8 + 6 + 5 + 2) / 5 = 5. The deviations from the mean are −1, 3, 1, 0 and −3. Squaring these gives 1, 9, 1, 0 and 9, which sum to 20. Treating the five values as a population, the variance is 20 / 5 = 4. Treating them as a sample, the variance is 20 / 4 = 5. The sample figure is slightly larger, reflecting Bessel’s correction.

    Variance and the standard deviation

    Variance and the standard deviation describe the same property of spread, but in different units. The standard deviation is simply the square root of the variance, which returns the measure to the original units of the data. In our worked example the population standard deviation is √4 = 2. Because the standard deviation is easier to interpret alongside the mean, it is often reported in papers; see our companion piece on the standard deviation for detail. Variance, however, has convenient mathematical properties, which is why it underlies so many statistical procedures.

    Interpreting variance correctly

    Because variance is in squared units, its absolute size is hard to interpret in isolation. A variance of 4 cm² is meaningful only relative to the scale of the measurement. Variance is also sensitive to outliers: squaring magnifies the effect of extreme values, so a single anomalous point can inflate the variance substantially. Always inspect your data distribution before reporting variance, and define the term consistently in your methods. The CASRAI dictionary and our author guidance encourage precise, reproducible statistical reporting.

    Frequently asked questions

    Why is variance squared rather than absolute?

    Squaring the deviations keeps the measure mathematically tractable and differentiable, which makes it the natural basis for least squares estimation and many other techniques. The absolute deviation is an alternative but lacks these convenient properties.

    When should I divide by n − 1 instead of n?

    Divide by n − 1 whenever your data are a sample used to estimate the variance of a wider population. Divide by N only when your data genuinely represent the entire population of interest.

    Is a high variance bad?

    Not inherently. High variance simply means greater spread. Whether that is good or bad depends on context: high variance in measurement error is undesirable, but natural biological variation may be expected and informative.

  • ANOVA (Analysis of Variance) Explained: Comparing Means Across Groups

    Analysis of variance (ANOVA) is a statistical method that tests whether the means of three or more groups differ by more than would be expected from random variation alone. It does this by comparing the variance between group means against the variance within groups, summarised in a single F-statistic. ANOVA is one of the most widely used inferential tests in experimental research, and reporting it transparently is central to reproducible analysis.

    Why ANOVA instead of multiple t-tests?

    A t-test compares two group means. When you have three or more groups, it is tempting to run a separate t-test for every pair. The problem is the family-wise error rate: each test carries its own chance of a false positive, and those chances accumulate. With three groups there are three pairwise comparisons; at a 5% significance level the probability of at least one false positive rises to roughly 14%, and it climbs further as groups are added. ANOVA solves this by performing a single omnibus test that asks one question: are any of the group means different?

    This control of error is why ANOVA underpins so much of experimental design. For a refresher on what significance thresholds mean in practice, see our explainer on p-values and statistical significance.

    The F-statistic and how it works

    ANOVA partitions the total variability in the data into two components. The between-groups variance reflects how far each group mean sits from the overall (grand) mean. The within-groups variance reflects the natural spread of observations inside each group. The F-statistic is the ratio of these two:

    F = between-groups variance / within-groups variance

    If the groups truly share a common mean, both quantities estimate the same underlying variability and F sits near 1. When real differences exist, the between-groups term grows and F rises. A large F, evaluated against the F-distribution with the appropriate degrees of freedom, yields a small p-value and signals that at least one mean differs.

    One-way versus two-way ANOVA

    The design depends on how many factors you are manipulating.

    Feature One-way ANOVA Two-way ANOVA
    Number of factors One independent variable Two independent variables
    Example question Does diet type affect plant growth? Do diet type and watering frequency affect plant growth?
    Main effects One Two (one per factor)
    Interaction Not assessed Tests whether factors combine non-additively
    Output Single F-statistic F-statistic for each main effect plus interaction

    The key advantage of two-way ANOVA is the interaction effect: it reveals whether the influence of one factor depends on the level of another, something separate analyses would miss.

    Assumptions you must check

    ANOVA rests on three core assumptions. Observations should be independent. The residuals should be approximately normally distributed. And the groups should show roughly equal variances, a property called homogeneity of variance (homoscedasticity). When variances differ markedly, a Welch ANOVA is a robust alternative; when normality fails, a non-parametric Kruskal-Wallis test may be more appropriate. Stating which assumptions were tested, and how, is good practice and supports replication, as we discuss across our reproducibility coverage.

    Post-hoc tests: locating the difference

    A significant ANOVA tells you that some mean differs, but not which one. Post-hoc tests answer that follow-up while still controlling the family-wise error rate. Tukey’s HSD is the standard choice for all pairwise comparisons with equal sample sizes; Bonferroni correction is conservative and simple; Scheffe’s test is flexible for complex contrasts. Crucially, you should not revert to uncorrected t-tests after a significant ANOVA, as that reintroduces the inflated error the test was designed to prevent.

    Equally important, statistical significance does not measure how large a difference is. Always pair ANOVA results with an effect size such as eta-squared, as covered in our companion piece on why effect size matters beyond significance. Authors planning a study should also budget adequate sample size and statistical power so a real effect can actually be detected.

    Frequently asked questions

    What does a significant ANOVA result actually tell you?

    It tells you that at least one group mean differs from the others by more than chance would explain. It does not identify which groups differ or how large the difference is; you need post-hoc tests and effect sizes to answer those questions.

    Can ANOVA be used for only two groups?

    Yes. With two groups a one-way ANOVA gives results mathematically equivalent to an independent-samples t-test (F equals t squared). ANOVA’s real value appears with three or more groups, where it prevents the error inflation of multiple t-tests.

    What is the difference between a main effect and an interaction?

    A main effect is the overall influence of one factor averaged across the others. An interaction means the effect of one factor changes depending on the level of another. Detecting interactions is the principal reason to use two-way rather than one-way designs.

    How should ANOVA results be reported for reproducibility?

    Report the F-statistic with both degrees of freedom, the p-value, an effect size, the post-hoc method used, and confirmation that assumptions were checked. The CASRAI dictionary and our guidance for authors set out the metadata that makes such results auditable.

  • Research costing and full economic costing: calculating the true cost of research

    Ask what a research project costs and most people will think of the obvious things: the salaries of the researchers, the equipment, the consumables, the travel. These are real costs, but far from the whole story. Behind every project stands an apparatus that makes it possible — the buildings, the heat and light, the libraries and computing, the finance, HR and research-management staff, the depreciation of shared facilities, the cost of the institution’s very existence. These are no less real for being less visible, and if they are not accounted for the institution is quietly subsidising the research from elsewhere. Working out what research genuinely costs, and how much of it a funder will pay, is the unglamorous but essential discipline of research costing. This article explains how the true cost of research is calculated and recovered, drawing on the funding and finance domain of the CASRAI Dictionary.

    Direct and indirect costs

    The foundational distinction in research costing is between direct and indirect costs. Direct costs are those that can be attributed specifically to a project: the salaries of the people working on it, the equipment bought for it, its consumables and its travel. Indirect costs — often called overheads — are the costs of shared infrastructure and services the project relies on but that cannot be tied to it alone: estates, central administration, IT, libraries, finance and HR, and the general running of the institution. The crucial point is that indirect costs are real costs of doing the research, even though they are shared. A project conducted as if only its direct costs mattered would appear far cheaper than it truly is, and the gap would be made up invisibly by the host institution.

    Full economic costing

    The principle that the entire cost of research — direct and indirect — should be identified is captured in the idea of full economic costing (FEC). Its aim is to reveal what research actually costs to undertake, including a fair share of the institution’s overheads and infrastructure, so that decisions about pricing, funding and sustainability rest on an honest basis. Without it, institutions cannot know whether their research is financially sustainable, and may unknowingly run at a loss on activity that appears, on a narrow view, to be funded. Full economic costing does not by itself determine who pays; it establishes the true figure against which questions of payment can be sensibly considered. It is, in effect, the costing equivalent of telling the truth about the bill.

    TRAC: the UK approach

    In the United Kingdom, the method used to arrive at the full economic cost is TRAC — the Transparent Approach to Costing. TRAC is the standard by which UK universities cost their activities, including research, attributing the relevant share of indirect and estates costs to reach a full economic cost figure, from which the rates used in pricing proposals derive. TRAC matters because it provides a consistent, accepted basis for understanding institutional costs, which underpins negotiations with funders about what they will pay. Its existence means the full cost of UK research is not guesswork or special pleading but the output of an agreed methodology — giving institutions and funders a common, defensible starting point.

    Funders rarely pay the full cost

    Here lies the central tension of research costing: knowing the full economic cost does not mean a funder will pay it. Funder cost policies vary widely in how much of the full cost they cover, and many fund research at less than its full economic cost. The shortfall — between what the research costs and what the funder pays — must be met by the institution from its other income. This has profound consequences: institutions must think carefully about the portfolio of research they undertake and how it is sustained; winning a grant is not the same as the research being fully funded; and the apparently arcane details of funder cost policies bear directly on institutional viability. The recovery of indirect costs in particular is a perennial point of negotiation, because indirect costs are precisely what funders are most often reluctant to pay in full.

    International approaches to indirect costs

    Different systems handle indirect costs in characteristically different ways, and the contrasts are illuminating:

    • In the United States, federal research grants typically apply a negotiated indirect cost rate — facilities and administrative (F&A) costs — agreed between an institution and the government and applied to direct costs.
    • In the European Union’s research framework programmes, indirect costs have commonly been handled through a flat-rate approach, reimbursing overheads as a fixed percentage of eligible direct costs rather than calculating them project by project.
    • In the United Kingdom, the TRAC methodology produces full economic cost figures, with funders such as the research councils contributing a defined proportion of that cost.

    Each model attempts to solve the same problem — how to recognise and recover the genuine but shared costs of research — and each strikes a different balance between accuracy, simplicity and the funder’s willingness to pay. The differences matter for institutions working across multiple funders and jurisdictions, where the same project may be costed and reimbursed quite differently depending on whose rules apply.

    Costing as part of research management

    Research costing is one of the less visible but more consequential parts of research administration. Getting it right protects the financial health of institutions and ensures that difficult conversations about funding rest on accurate figures rather than wishful thinking. For costs to be compared, reported and managed across institutions, funders and systems, the categories involved — direct, indirect, estates, full economic cost, recovery rate — must mean the same thing everywhere. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that financial information about research is understood identically wherever it is recorded. And because the work behind every funded project is what the costing supports, the contributions involved can be described within the same framework as any other — the CRediT taxonomy and its contribution roles. Research is never free, and someone always pays for the infrastructure that makes it possible; full economic costing is the discipline of being honest about that fact.

  • Data management plan examples: from narrative DMP to maDMP

    The data management plan has a reputation problem. For many researchers it is a compliance document written in the final week before a grant deadline, accepted by the funder, and never opened again. That is a waste of a genuinely useful artefact, and it is also a missed opportunity, because the same plan can be expressed in a form that systems can act on. This article walks from a conventional narrative DMP to a machine-actionable one, showing what the structure buys you. It builds on the machine-actionable DMPs domain and connects to the workflows described under research administration.

    Where most DMPs start: the narrative plan

    A data management plan (DMP) is a document describing how data will be handled during and after a project: what data will be produced, how they will be stored and backed up, how they will be documented and shared, who is responsible, and how long they will be kept. In its common form it is narrative prose, often written against a funder or institutional template, answering a set of standard headings.

    A narrative answer typically reads something like this:

    “The project will generate approximately 200 GB of microscopy image data and associated tabular measurements. Active data will be stored on the institutional research-data store with nightly backup. On publication, processed datasets will be deposited in a generalist repository under a CC-BY licence and assigned a DOI. Raw image data containing no personal information will be retained for ten years in line with institutional policy. The principal investigator is responsible for data management.”

    There is nothing wrong with this. It is clear, honest, and answers the questions. Its limitation is purely that it is prose: a human must read it to extract any single fact, and no system can check it, update it, or connect it to anything else. Each fact — the licence, the repository, the retention period, the responsible person — is locked inside a sentence.

    The next step: structure the same content

    A machine-actionable DMP (maDMP) contains the same information, but expressed as structured, identified data rather than free text. The reference model is the RDA DMP Common Standard — a JSON schema developed by the Research Data Alliance to represent DMP content in a consistent, exchangeable form. Rather than a paragraph, each element becomes a typed field: a dataset has a title, a type, a personal-data flag, a planned size, a distribution with a named host and a licence, and a link to the responsible contributor, who is in turn identified by an ORCID iD.

    The narrative paragraph above, restructured against that model, becomes a set of explicit elements:

    • Dataset: “Microscopy image data” — type: image; personal data: no; estimated volume: 200 GB.
    • Distribution: host: named generalist repository; access: open; licence: CC-BY; identifier: DOI (assigned on deposit).
    • Retention: 10 years, per institutional policy.
    • Contributor: the principal investigator, identified by ORCID iD, with the role of data contact.
    • The whole plan itself carries a DMP ID — a persistent identifier, typically a DataCite DOI — so it can be cited and referenced across systems.

    The content is unchanged. What changes is that every fact is now addressable on its own.

    What the structure makes possible

    Structuring the plan is not bureaucracy for its own sake; it unlocks behaviours that a narrative simply cannot support.

    • Validation. A system can check that every planned dataset names a repository, that every distribution has a licence, and that the plan meets the funder’s required elements — before submission, automatically.
    • Exchange between systems. A plan authored in a DMP tool can be passed to the institution’s research-information system, to the repository at deposit time, and back to the funder, without anyone re-keying it. This is the maDMP exchange the standard was built for.
    • The living DMP. Because each element is addressable, the plan can be updated as the project unfolds — an anticipated dataset becomes a realised one when it is deposited, and the deposit’s DOI flows back into the plan. The DMP stops being frozen at award and becomes a current record of what actually happened to the data.
    • Connection to the wider record. Because the plan, its datasets, its contributors, and its host all carry identifiers, the DMP becomes a node in the identifier graph — linkable to the project (via a RAiD), to the people (via ORCID), to the institution (via ROR), and to the outputs (via their DOIs).

    A realistic view of where this stands

    It is worth being candid: machine-actionable DMPs are an active and maturing area, not a universally deployed reality. The RDA Common Standard exists and is implemented in several DMP tools; DMP IDs are being minted; and funders are beginning to express interest in structured plans. But many researchers still write, and many funders still accept, narrative plans, and the end-to-end exchange between tools, repositories, and funders is still being built out. The practical takeaway is not that you must produce a maDMP tomorrow, but that writing your narrative plan with structure in mind — naming repositories, stating licences explicitly, identifying people by ORCID, treating the plan as living — positions the same content to become machine-actionable as the infrastructure matures.

    The cheapest move toward a machine-actionable plan is to stop writing the DMP as an essay and start writing it as a set of clear, specific commitments — named repository, explicit licence, identified people, stated retention. Structured thinking comes before structured data.

    Where shared vocabulary fits

    “Dataset”, “distribution”, “retention period”, “data contact”, and “living DMP” need to mean the same thing in a DMP tool, a repository, a CRIS, and a funder’s system for any of this exchange to work. A shared, federated vocabulary that defines these elements precisely — pointing back to the RDA DMP Common Standard for the schema — is what lets a plan authored in one system be acted on by another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the machine-actionable DMPs domain.

    Related reading

  • Ethics review and the IRB/REC process: what researchers should expect

    For research that involves people — their bodies, their behaviour, their data, their tissue — ethics review is not a bureaucratic hoop to clear before the real work begins. It is a substantive safeguard, the mechanism by which a community of researchers commits, in advance, that the people they study will be respected, protected and treated fairly. Researchers who approach it as a formality tend to find it frustrating; those who understand what it is trying to achieve usually find it navigable. This article explains what an ethics committee does, the review tiers a researcher will encounter, and the principles that underpin the whole system, drawing on the framework set out in the compliance and regulatory domain of the CASRAI Dictionary.

    What the committee is called, and what it does

    The body that conducts this review goes by different names in different places. In the United States it is the Institutional Review Board (IRB); in the United Kingdom and much of Europe it is the Research Ethics Committee (REC); in Australia it is the Human Research Ethics Committee (HREC). The names differ but the function is the same: an independent group, including both expert and lay members, that reviews proposed research involving human participants to ensure it is ethically acceptable before it proceeds.

    What the committee weighs is consistent across these systems. It assesses whether the risks to participants are reasonable in relation to the anticipated benefits; whether participants will give genuinely informed and voluntary consent; whether the selection of participants is fair; whether privacy and confidentiality are adequately protected; and whether any vulnerable groups involved have additional safeguards. The committee’s independence matters because it is precisely the people closest to a project — its own investigators — who are least able to judge its risks dispassionately.

    The tiers of review

    One of the most useful things a researcher can understand early is that review is not one-size-fits-all. Most systems operate graded tiers of review scaled to the risk a study poses, and knowing which tier applies sets realistic expectations for time and scrutiny.

    • Exempt review is for certain categories of low-risk research — for example some research using anonymised existing data, or certain educational and survey studies — that meet defined criteria. ‘Exempt’ does not mean no review at all; it usually means the committee, not the investigator, confirms that the exemption applies.
    • Expedited review is for research that poses no more than minimal risk and falls within specified categories. It is conducted by one or a few experienced reviewers rather than the full committee, which makes it quicker without lowering the standard for the questions asked.
    • Full board review is for research that involves more than minimal risk, vulnerable populations, or sensitive interventions. The whole convened committee considers it, and this is the most thorough — and necessarily the slowest — route.

    The single most common cause of frustration is a mismatch of expectation: submitting a higher-risk protocol and expecting an expedited timeline. Identifying the likely tier at the planning stage, and building the corresponding time into the project, prevents most of that friction.

    The Declaration of Helsinki and its lineage

    None of this arose in a vacuum. The modern ethics-review system rests on a series of foundational documents written in response to historical abuses. The Declaration of Helsinki, developed by the World Medical Association, is the central statement of ethical principles for medical research involving human subjects, and it is periodically revised to keep pace with new challenges. It articulates duties that have become the bedrock of review: the wellbeing of the individual participant takes precedence over the interests of science and society; participation must be voluntary and informed; risks must be minimised and justified; and research must be conducted by suitably qualified people under proper protocols.

    Alongside Helsinki sit other touchstones — in the United States, the principles articulated in the Belmont Report (respect for persons, beneficence and justice) and the federal Common Rule that operationalises them. A researcher does not need to memorise these documents, but understanding that the committee’s questions descend from them helps make sense of why it asks what it asks.

    Informed consent, done properly

    If one element sits at the centre of review, it is informed consent. Consent is not a signature on a form; it is a process by which a potential participant comes to understand what the research involves, what risks and benefits it carries, that participation is voluntary, and that they may withdraw without penalty. Committees scrutinise consent materials closely — for readability, completeness and honesty — and pay particular attention where consent is complicated: research with children, with adults who lack capacity, in emergency settings, or across cultural and language differences. The recurring expectation is that the participant genuinely understands and genuinely chooses, not merely that a box has been ticked.

    Working with the process, not against it

    Researchers get the most out of ethics review by treating the committee as a collaborator in protecting participants rather than as an obstacle. That means engaging early, before a protocol is locked; writing the application for an intelligent non-specialist, since lay members are part of the point; being candid about risks rather than minimising them, because a committee trusts an application that confronts its own weaknesses; and remembering that review continues after approval, through reporting of adverse events, amendments and, often, continuing review. Recording ethics approvals and their status as structured compliance metadata — alongside other obligations and the recognition of contributors through the CRediT taxonomy — helps keep this information visible across the research record rather than buried in a filing cabinet. The consistent vocabulary for describing ethics review, approval status and the wider compliance landscape is maintained in the CASRAI Dictionary.

  • Resolving author-order disputes: prevention and the COPE approach

    Few conflicts in research are as common, or as bitter, as a dispute over who appears where on the author line. The stakes are real: in many fields position carries career-defining information, and a demotion from first to second author can shape a hiring or tenure decision. These disputes are also largely preventable, and where they are not, there is a well-established process for handling them fairly. This article covers both, drawing on the practical guidance at resolving authorship disputes and the conventions of author order.

    Why author order carries so much weight

    To prevent disputes you have to understand what is being fought over. In many disciplines, position on the author line is not decorative; it is information. By widespread convention, the first author is the person who contributed most — typically the researcher who did the bulk of the work and wrote the draft. The last author is, in many fields, the senior position: the principal investigator or laboratory head who supervised the work. The corresponding author takes responsibility for the manuscript through review and after publication and is the point of contact for the record. Other conventions exist — alphabetical ordering is standard in mathematics, economics, and parts of the humanities, where order carries no contribution signal at all.

    The trouble is that these conventions are field-specific, tacit, and sometimes contradictory. A collaboration spanning disciplines may contain people who each “know” a different rule. When the rule is unstated, the gap fills with assumption, and assumption is where disputes are born.

    Prevention: the single most effective measure

    The overwhelming majority of author-order disputes can be avoided by one practice: agreeing authorship and order early, explicitly, and in writing — and revisiting the agreement as the work evolves. An early conversation forces the tacit conventions into the open, surfaces disagreement while it is still small, and creates a record to refer back to. The conversation should cover who will be an author at all (under the field’s authorship criteria), the basis for ordering, who will be corresponding author, and how the agreement will be revised if contributions shift. Projects change; an authorship agreement made at the outset should be treated as living, not fixed.

    Almost every intractable author-order dispute traces back to a conversation that never happened. The five minutes of awkwardness in agreeing order at the start of a project is the cheapest insurance in research.

    How CRediT helps prevent and de-escalate

    A contribution statement does not, by itself, decide order — and it is important to be clear that CRediT does not encode author order. It records what each person did, not where they sit on the line. But that very transparency is a powerful preventive tool. When a team fills in a CRediT statement together, mapping each person’s work to the fourteen roles, the relative contributions become explicit and discussable on the basis of fact rather than feeling. A disagreement about order can then be grounded in “who did what”, which is far easier to resolve than a clash of unspoken expectations. CRediT will not tell you who should be first author; it will give you the shared, honest picture of contribution from which a fair ordering conversation can proceed.

    When prevention fails: the COPE approach

    Sometimes a dispute arrives anyway — a co-author objects to the order at submission, or a contributor demands to be added or removed, or a conflict erupts after acceptance. Editors are not left to improvise. The Committee on Publication Ethics (COPE) publishes flowcharts and guidance for exactly these situations, including changes to authorship after submission and disputes over who should be listed. The COPE approach has a consistent shape worth understanding:

    • The journal does not adjudicate the merits. Editors are not equipped, and have no standing, to decide who really contributed most. Their role is to ensure a fair, documented process, not to rule on the underlying contribution claim.
    • All listed and proposed authors must agree to any change. An author cannot be added, removed, or reordered without the documented agreement of all parties concerned.
    • The dispute is referred to the institution. Where authors cannot agree, COPE directs editors to ask the authors’ institution(s) to investigate, because the institution — not the journal — has the authority and the facts to resolve a contribution dispute.
    • The manuscript is paused, not pushed through. Publication is typically held until the dispute is resolved, so that the journal does not put its name to a contested authorship record.

    This division of labour is deliberate. The journal protects the integrity of the record by refusing to publish a disputed author list; the institution, which employs the people and holds the project records, does the fact-finding. Following the COPE flowchart gives editors a defensible, even-handed process and protects everyone involved from arbitrary decisions.

    A note on changing authorship after submission

    Requests to add or remove an author after submission are a frequent flashpoint and deserve particular care. A legitimate request — a contributor was genuinely overlooked, or a listed person turns out not to meet the criteria — should be handled transparently, with a clear written explanation and the agreement of all authors. A request that looks like a late attempt to add a guest author, or to remove someone out of conflict, is exactly the situation the COPE guidance is built to slow down and document. The bright line is the same one that governs authorship generally: the list must reflect genuine contribution and accountability, not convenience or pressure.

    Where shared vocabulary fits

    “First author”, “corresponding author”, “senior author”, and the meaning of order itself vary by discipline, and that variation is a frequent source of cross-field confusion. A shared, federated vocabulary that defines these roles and conventions precisely — pointing back to COPE for dispute handling and to ICMJE for the authorship criteria — is what lets collaborators from different fields negotiate on common terms. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the research-integrity domain.

    Related reading

  • Licensing research data: CC-BY, CC0 and when to use each

    You can deposit a dataset in a trusted repository, describe it with rich metadata, and give it a DOI — and still leave it effectively unusable, because you forgot the one line that tells a reuser what they are allowed to do with it. A dataset without a clear licence is data nobody can confidently build on: a careful researcher, unsure of the terms, will simply not reuse it. Licensing is therefore not a legal afterthought but the part of the data-infrastructure domain that determines whether a deposit delivers the “R” in FAIR at all. This guide explains the main choices — principally CC0 and CC BY — and when each fits.

    Why a licence is the reusability switch

    The FAIR principles ask that data be Findable, Accessible, Interoperable, and Reusable — and reusability rests explicitly on data being “released with a clear and accessible data usage licence”. Without a licence, default copyright and database rights leave the legal status ambiguous, and ambiguity is fatal to reuse: a would-be user cannot tell whether combining your data with theirs, redistributing it, or building a tool on it is permitted. An explicit, standard, machine-readable licence resolves that uncertainty in advance, for everyone, without anyone having to ask. That is why “attach an explicit licence” is the step that turns a findable dataset into a reusable one.

    The two main choices for data

    CC0 — the public-domain dedication

    CC0 is a Creative Commons tool by which the rights-holder waives, to the fullest extent the law allows, all copyright and related rights in the work — placing it as close to the public domain as possible. For data, CC0 means a reuser can use, combine, modify, and redistribute the data with no conditions at all, including no obligation to attribute. This is widely recommended as the default for research data, and for a specific reason: data are routinely aggregated from many sources, and attribution requirements that stack up across hundreds of datasets (“attribution stacking”) can become legally and practically unworkable. CC0 removes that friction entirely and maximises interoperability. Several major data repositories and infrastructures apply CC0 by default for exactly this reason.

    Importantly, CC0 waives legal requirements, not scholarly norms. Citing the data you use remains an academic and ethical expectation regardless of the licence — CC0 simply means that expectation is enforced by the norms of good scholarship rather than by copyright law.

    CC BY — attribution required

    CC BY permits the same broad reuse — use, adaptation, redistribution, including commercially — but on the single condition that the original creator is credited. For data, CC BY is appropriate where attribution matters enough to be a legal condition, or where a funder or institution requires it. It is the most permissive of the conditional Creative Commons licences and is the default for many open-access publications. The trade-off relative to CC0 is precisely the attribution clause: it guarantees credit, but it reintroduces the attribution-stacking problem when many datasets are combined.

    Choosing between them

    • Prefer CC0 for data intended for the widest possible aggregation and reuse, especially where the data will be merged with many other sources. It maximises interoperability and removes legal friction; rely on citation norms for credit.
    • Choose CC BY where attribution must be a legal condition, where a funder or repository mandates it, or where the dataset is a discrete, citable product whose creators need enforceable credit.
    • Be cautious with more restrictive clauses. Non-commercial (NC) and No-Derivatives (ND) terms substantially limit reuse and can render data incompatible with other open data; they are generally discouraged for research data unless a specific ethical or legal constraint demands them.

    Data are not software: a critical caveat

    Creative Commons licences are designed for content — text, images, and data — and Creative Commons itself advises against using them for software. Software has needs that CC licences do not address: patent grants, the distinction between source and compiled code, and copyleft mechanics. For code, use a recognised software licence instead — a permissive one such as MIT, BSD, or Apache 2.0, or a copyleft one such as the GPL. If your deposit bundles a dataset and the code that processes it, licence each part appropriately: a CC licence (or CC0) for the data, an OSI-approved software licence for the code. Conflating the two is one of the most common licensing mistakes in research deposits.

    A practical checklist

    1. Confirm you have the right to licence the data. Check funder terms, any data-sharing agreements, third-party data within your dataset, and — for personal or sensitive data — consent and governance constraints. A licence cannot grant rights you do not hold.
    2. Default to CC0 for data unless there is a positive reason to require attribution; choose CC BY where there is.
    3. Licence software separately with an OSI-approved licence; never put code under a Creative Commons licence.
    4. State the licence explicitly in the deposit metadata and in any data availability statement, using the standard licence identifier so it is machine-readable.
    5. Cite the data you reuse regardless of its licence — the scholarly norm holds even when the law does not require it.

    How this connects to contribution and credit

    Licensing answers “what may be done with this output?”; it is a sibling of the question “who made it?”, which the CRediT taxonomy answers. A dataset’s intellectual work is recorded on the associated paper through roles such as Data curation and Investigation, while the licence governs downstream reuse of the artefact itself. Used together — a clear licence on the data and clear contribution roles on the people — they ensure both the dataset and its creators are properly accounted for.

    Where shared vocabulary fits

    “CC0”, “CC BY”, “public domain”, “attribution”, and “reuse” are interpreted differently across repositories and funders, which undermines the very interoperability that licensing is meant to enable. A shared, federated vocabulary that defines these terms precisely — pointing back to Creative Commons for the licences and to the FAIR principles for the reusability requirement — is what lets a licence chosen for one repository be understood correctly in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain.

    Related reading

  • Ghost, guest and honorary authorship: what they are and how to avoid them

    Two opposite failures corrupt the authorship record, and they are mirror images of each other. In one, a name appears on a paper that should not be there; in the other, a person who did substantial work is left off entirely. Both distort who is accountable for the published work, and both are forms of authorship misconduct that journals and integrity bodies treat seriously. This article explains what they are and how to avoid them, building on the account of authorship and accountability and the formal authorship criteria.

    The starting point: authorship is accountability

    You cannot define the abuses without first fixing what authorship is supposed to be. The dominant standard in biomedical and much of STEM publishing is the ICMJE recommendation, which sets four criteria, all of which an author should meet: substantial contribution to the conception or design of the work, or to the acquisition, analysis, or interpretation of data; drafting the work or revising it critically for important intellectual content; final approval of the version to be published; and agreement to be accountable for all aspects of the work. The decisive idea running through all four is accountability. An author is someone who can answer for the work, not merely someone connected to it. Every form of authorship abuse is, at bottom, a breaking of that link between credit and accountability.

    Guest and honorary authorship: names that should not be there

    Guest authorship, also called honorary or gift authorship, is the inclusion of a person as an author when they have not made a contribution meeting the authorship criteria. The motives are familiar:

    • Adding a senior figure — a department head or laboratory director — whose name lends prestige but who did not contribute substantively to the specific work.
    • Reciprocal arrangements, where colleagues add each other to papers to inflate both publication lists.
    • Coercion, where a person in authority pressures a junior researcher to include them.

    Whatever the motive, the effect is the same: a name on the author line carries an implicit claim of contribution and accountability that is false. It dilutes the credit owed to those who did the work, and it attaches accountability to someone who cannot genuinely answer for the research. Honorary authorship is not a harmless courtesy; it is a misrepresentation of the contribution record.

    Ghost authorship: the writers who vanish

    Ghost authorship is the opposite failure: someone who made a contribution that qualifies for authorship, or who did substantial work on the manuscript, is not named as an author and frequently not acknowledged at all. The classic and most damaging case is the professional medical writer, often funded by a commercial sponsor, who drafts a paper that is then published under the names of academic authors with no disclosure of the writer’s role. Ghost authorship is especially corrosive because it conceals influence: a reader cannot weigh a possible conflict of interest they cannot see. It hides who actually shaped the words and, sometimes, who paid for them.

    There is a subtler, everyday version too. Postdocs, graduate students, and technicians who did substantial Investigation or Software work are sometimes pushed below the authorship line and into a footnote, or omitted entirely. Each instance erodes the integrity of the record by severing the contribution from the contributor.

    How the ICMJE criteria prevent both

    The elegance of a clear authorship standard is that the same test catches both abuses. Apply the four criteria honestly and the guest author fails them — they made no substantial contribution and cannot be accountable — so they should not be on the author line. Apply them honestly and the ghost is revealed — the medical writer who drafted the paper plainly meets the contribution and drafting criteria, so they must be named or, where they decline authorship, their role must be disclosed. The criteria are a bright line that, used in good faith, makes both the unearned name and the missing one visible.

    A useful discipline: for every name on the author line, ask whether that person can answer for the work. For everyone who did substantial work, ask whether they appear. The first question catches guests; the second catches ghosts.

    How CRediT helps — and one trap to avoid

    The CRediT taxonomy strengthens the defence by making contribution explicit. When each author’s specific roles are recorded against the fourteen CRediT roles, a guest author has nowhere to hide: they must either claim a role they did not perform — a falsifiable and serious misstatement — or appear with no roles at all, which invites the obvious question. A transparent contribution statement makes honorary authorship costly to sustain.

    But there is a trap. Because most publishers apply CRediT only to named authors, the taxonomy can inadvertently encourage a mild form of ghosting: authors, unable to credit the technician or writer who did the work, attribute that work to themselves. The fix is to credit contributors properly — through acknowledgements where authorship is genuinely not warranted, and by extending structured contribution metadata to acknowledged contributors as the standard evolves — rather than absorbing their roles into an author’s line.

    What to do — for authors, supervisors and journals

    • Agree authorship early. Decide, in writing, who will be an author and on what basis at the start of a project, and revisit it as contributions change. Most disputes and abuses grow from silence.
    • Apply the criteria, not the hierarchy. Seniority is not a contribution. A director who did not contribute substantively should be acknowledged, not authored.
    • Name the writers. Professional and medical writers must be disclosed; ghost-writing is incompatible with publication integrity.
    • Use contribution statements. A CRediT statement confirmed by every named author makes both guests and ghosts harder to sustain.
    • Follow COPE guidance when problems surface. The Committee on Publication Ethics provides flowcharts for editors handling suspected guest or ghost authorship; they set out a fair, documented process.

    Where shared vocabulary fits

    Terms like “guest”, “gift”, “honorary”, and “ghost” authorship are used loosely and sometimes interchangeably, which weakens policy that depends on them. A shared, federated vocabulary that defines these precisely — pointing back to ICMJE for the criteria and COPE for the handling of misconduct — is what lets editors and institutions act on a common understanding. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the research-integrity domain.

    Related reading