Tag: responsible metrics

  • Responsible metrics: the Leiden Manifesto and the Metric Tide in practice

    Metrics are seductive because they are simple. A single number — a journal’s impact factor, a researcher’s h-index, a citation count — promises to compress the messy, qualitative business of judging research into something fast, comparable and apparently objective. And metrics are dangerous for exactly the same reason: their simplicity hides what they leave out, and their apparent objectivity lends unearned authority to comparisons they cannot really support. The response to this tension has not been to abolish metrics but to use them responsibly — to let quantitative indicators inform expert judgement rather than replace it. Two landmark statements from 2015, the Leiden Manifesto and The Metric Tide, set out what responsible use looks like. This article examines both and how they translate into practice, drawing on the responsible assessment domain of the CASRAI Dictionary.

    The Leiden Manifesto

    The Leiden Manifesto for research metrics, published in 2015, offers ten principles for the responsible use of quantitative indicators. Several of its themes recur throughout the responsible-metrics movement and are worth drawing out. It insists that quantitative evaluation should support, not supplant, qualitative expert assessment — metrics inform judgement; they do not make it. It warns against measuring performance against inappropriate or generic benchmarks, urging that assessment account for the mission and context of the research. It calls for transparency in the data and methods behind any indicator, so that those being assessed can understand and scrutinise how they are judged. It highlights the importance of accounting for variation between fields, since citation behaviour differs enormously across disciplines and naive comparison across them is meaningless. And it cautions against the distortions metrics produce when they become targets — the well-known problem that an indicator, once it is what people are rewarded for, stops measuring what it was meant to.

    The Metric Tide

    Published the same year, The Metric Tide was an independent review of the role of metrics in research assessment, conducted in the United Kingdom. Its central contribution was the concept of responsible metrics, defined through a set of dimensions that have become a common reference point:

    • Robustness — basing indicators on the best available, accurate data.
    • Humility — recognising that quantitative evaluation should support, not supplant, expert assessment.
    • Transparency — keeping data collection and analytical processes open to scrutiny.
    • Diversity — accounting for variation by field and using a range of indicators to reflect the plurality of research.
    • Reflexivity — recognising and anticipating the systemic effects of indicators and updating them in response.

    The review was notably sceptical of reducing assessment to single numbers and emphasised that metrics work best as a complement to peer review, not a substitute for it. Its framing of responsible metrics as a set of dimensions to be designed for, rather than a checklist to be passed, has proved durable.

    What the two have in common

    Read together, the Leiden Manifesto and The Metric Tide converge on a consistent message. Metrics are useful but partial; they must be transparent so they can be questioned; they must respect disciplinary difference; they must be used with humility alongside expert judgement; and their users must stay alert to the behaviour they induce, because any metric that becomes a target will eventually be gamed or will distort the work it was meant to measure. Neither document is anti-metric. Both are against the misuse of metrics — against the false precision of a single number standing in for a considered judgement about the quality and significance of research.

    From principle to practice

    Translating these principles into institutional practice means concrete commitments: assessing research on its own merits rather than on the prestige of its publication venue, using a basket of indicators rather than any single one, being transparent about what is measured and how, contextualising comparisons by field and career stage, and keeping expert peer judgement at the centre with metrics in a supporting role. These commitments connect directly to the broader assessment-reform movement. The principle of not judging research by where it is published is the heart of the comparison in our DORA versus CoARA overview, while the specific hazards of the two most over-used single numbers are examined in our look at the journal impact factor versus the h-index. Responsible metrics is the methodological backbone these reform initiatives share.

    Metrics and the recognition of contribution

    One reason single-number metrics mislead is that they obscure who actually did the work and what they did. A citation count attaches to a paper, not to the distinct contributions of the people who made it. Structured contributorship through the CRediT taxonomy — whose full set of roles is described in our overview of the CRediT roles — offers a more granular and honest picture of contribution than any aggregate metric can, and is a natural complement to responsible assessment: it supports judging people on what they genuinely contributed rather than on a number that flattens it. The consistent vocabulary that lets assessment frameworks, indicators and contribution records be described and exchanged the same way across systems is maintained in the CASRAI Dictionary, helping ensure that responsible metrics rests on a shared and well-defined foundation.

  • Altmetrics and research impact: what attention data can and cannot show

    Altmetrics promise something seductive: a near-real-time count of the attention a research output is attracting across news, policy documents, social media, blogs, and reference managers, available within days of publication rather than the years a citation count takes to accumulate. That promise is real, and altmetrics genuinely capture forms of reach that citations miss. But the same speed and breadth that make them useful also make them easy to misread, and the gap between “attention” and “impact” is where most of the trouble lies. This article sets out what altmetrics can and cannot show. It builds on the broader treatment in the engagement, impact and SDG-alignment domain.

    What altmetrics actually measure

    Altmetrics — short for alternative metrics — are indicators of the online attention and engagement a research output receives, drawn from sources outside the traditional citation databases. Typical sources include mentions in news outlets and policy documents, posts and shares on social media, blog coverage, Wikipedia citations, and saves in reference managers such as Mendeley. They are usually aggregated against a specific output — identified by its DOI — and presented as a score or a breakdown by source.

    The honest one-line description is this: altmetrics count attention. They tell you that an output was mentioned, shared, saved, or referenced in non-scholarly venues, and roughly where and how much. That is genuinely valuable information, and it is information that citation counts, by their nature, cannot provide.

    What they are useful for

    • Speed. Attention accrues within days, so altmetrics can surface early engagement long before citations could exist. For recent outputs they may be the only quantitative signal available.
    • Breadth beyond academia. A citation count measures uptake by other researchers. Altmetrics can show reach into policy, news media, and public discussion — audiences a citation count is structurally blind to. For an output whose value is partly its public or policy reach, this is exactly the dimension that matters.
    • Qualitative leads, not just numbers. The most useful part of an altmetric record is often not the score but the underlying mentions: which policy document cited the work, which outlet covered it, what the coverage said. Followed up, these point to specific instances of reach that can seed a genuine impact narrative.
    • A complement to citations. Used alongside citation data and qualitative evidence, altmetrics add a view that the other sources lack. Their role is supplementary, not substitutive.

    What they cannot show

    The central caution is simple and must be stated plainly: attention is not impact, and attention is not quality. A high altmetric score means an output was talked about; it says nothing, by itself, about whether the research is sound, whether the attention was positive, or whether any real-world change followed.

    • Attention can be negative. A paper widely shared because it is being criticised, debunked, or ridiculed can score highly. The count does not distinguish praise from condemnation.
    • Attention is not benefit. Genuine research impact — a changed policy, an improved treatment, an adopted practice — is a downstream outcome that an attention count cannot demonstrate. Altmetrics may flag where to look for impact; they are not evidence of it.
    • The numbers are gameable and biased. Social-media-derived metrics can be inflated by coordinated sharing, and they systematically favour topics, languages, and communities that are active online — which is not the same as the topics that matter most.
    • Scores are not comparable across contexts. A single composite altmetric number compresses very different kinds of attention into one figure, and that figure means different things in different fields and for different output types. Comparing scores across disciplines is largely meaningless.

    The responsible-metrics frame

    This is where the wider movement for responsible research assessment provides the discipline that keeps altmetrics honest. The Leiden Manifesto for research metrics (2015) set out principles for the responsible use of quantitative indicators that apply directly here. Three are especially relevant to altmetrics:

    • Quantitative evaluation should support, not supplant, expert qualitative judgment. Altmetrics are an input to a human assessment, never a replacement for reading the work and weighing its contribution.
    • Account for variation by field. Attention patterns differ enormously between disciplines and output types; a metric must be interpreted in context, not applied as a universal yardstick.
    • Avoid misplaced concreteness and false precision. A single score presented to a decimal point invites a confidence the underlying data do not support. The number is an indicator, not a measurement of worth.

    The same spirit runs through the broader reform agenda — the Declaration on Research Assessment (DORA) and the Coalition for Advancing Research Assessment (CoARA) — which presses evaluators away from reliance on any single quantitative proxy and toward judging the substance of contributions. Altmetrics sit comfortably inside that frame as one more contextual signal, and sit very badly outside it as a standalone score to be maximised.

    Treat an altmetric score the way you would treat a smoke alarm: useful for telling you where to look, useless as a measure of how big the fire is. The value is in the mentions it points you to, not in the number itself.

    Using altmetrics well

    1. Read the mentions, not just the score. The specific policy citation or news item is the evidence; the aggregate number is only a pointer.
    2. Pair them with citations and qualitative evidence. No single indicator carries an assessment; altmetrics are one strand among several.
    3. Interpret in context. Field, output type, and audience all change what a given level of attention means.
    4. Never use a score as a ranking or a target. Optimising for attention corrupts the signal and invites the gaming the metric is most vulnerable to.

    Where shared vocabulary fits

    “Impact”, “attention”, “reach”, “engagement”, and “altmetric” are used loosely and often interchangeably, which is exactly how attention data gets mistaken for evidence of benefit. A shared, federated vocabulary that defines these terms precisely — distinguishing attention from impact and pointing back to the Leiden Manifesto and the responsible-assessment frameworks for the caveats — is what lets engagement data be used honestly in evaluation. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the engagement, impact and SDG-alignment domain.

    Related reading

  • DORA, CoARA and narrative CVs: assessing research responsibly

    For a decade, “responsible research assessment” was mostly a matter of declarations — statements of principle that institutions signed and then struggled to operationalise. That has changed. Assessment reform has moved from declaration to practice, and anyone who now evaluates research or researchers — on a hiring panel, a promotion committee, or a grant board — is increasingly expected to do so by methods that the reform movement has made concrete. This article sets out how the three load-bearing pieces — DORA, CoARA, and the narrative CV — fit together, and what they ask of an evaluator. It draws on the responsible-assessment domain.

    DORA: the declaration that named the problem

    The Declaration on Research Assessment (DORA), issued in 2013, was the movement’s opening move. Its central target was the misuse of the journal impact factor as a proxy for the quality of individual papers and individual researchers. DORA’s argument was straightforward: a journal-level metric says nothing reliable about any single article published in that journal, and using it to judge a researcher’s work — for hiring, promotion, or funding — is a category error. DORA asked institutions, funders, and publishers to stop doing it, and to assess research on its own merits.

    DORA’s contribution was to name the problem clearly and to gather signatories — thousands of them — behind the principle. What it deliberately did not do was prescribe a detailed alternative. It was a declaration of what to stop, more than a manual for what to start. That left a gap, which the next decade’s work set out to fill.

    CoARA: from principle to coalition commitment

    The Coalition for Advancing Research Assessment (CoARA), launched in 2022, is the operational successor in spirit. Where DORA asked organisations to agree with a principle, CoARA asks members to commit to a reform agreement and to produce action plans for changing their own assessment practices. Its membership runs to hundreds of organisations — universities, funders, learned societies — across Europe and beyond.

    The shift from DORA to CoARA is the shift from “we endorse this” to “here is what we will change and by when.” CoARA’s commitments include recognising a diversity of research outputs and activities, basing assessment primarily on qualitative judgement supported by responsible use of metrics rather than the reverse, and abandoning inappropriate uses of journal- and publication-based metrics. It is, in effect, DORA’s principle turned into an implementation programme that members are accountable to.

    The narrative CV: the practical instrument

    If DORA named the problem and CoARA organised the commitment, the narrative CV is the instrument through which reform actually reaches an individual assessment. A narrative CV is a free-text format in which a researcher describes their contributions in prose, structured around a small set of prompts, rather than presenting an enumerated list of publications and metrics. The best-known implementation is UKRI’s Résumé for Research and Innovation (R4RI), which became standard across all UKRI funding from January 2024, building on the Royal Society’s earlier Résumé for Researchers. Wellcome, several other funders, and a number of institutions run their own variants.

    The narrative CV typically asks a researcher to describe their contributions across several dimensions — to the generation of knowledge, to the development of individuals, to the wider research community, and to broader society — rather than to list outputs by venue. The point is to make visible the contributions that a publication list renders invisible: mentorship, team building, peer review, open-science work, and the other forms of hidden labour that the Hidden REF initiative has campaigned to recognise. It is the mechanism by which a panel can assess a researcher as a contributor to research culture, not merely as a producer of papers.

    Responsible metrics, not no metrics

    A persistent misreading of this movement is that it is anti-metric. It is not. The principle, articulated in the Leiden Manifesto of 2015 and carried through CoARA, is responsible metrics: the principled use of quantitative indicators, always contextualised, always combined with qualitative expert judgement, never used as a substitute for reading the work. The objection is not to counting things; it is to letting a count — especially a journal-level one — stand in for judgement about an individual contribution. A responsible assessment may well use metrics; it simply refuses to let them do the assessing.

    How the three fit together

    The relationship is a progression from principle to practice. DORA supplies the foundational principle: do not mistake journal metrics for research quality. CoARA supplies the organised commitment and accountability: members agree to reform and publish how. The narrative CV supplies the concrete instrument: a format that forces an assessment to engage with what a researcher actually contributed. An evaluator working responsibly today is, in effect, applying DORA’s principle through CoARA-aligned practice using narrative-CV instruments.

    What responsible assessment asks of an evaluator

    Concretely, the movement asks an evaluator to read the work rather than its venue; to weigh a diversity of outputs — datasets, software, protocols, models — alongside articles, which presupposes a modern outputs taxonomy that recognises them; to use metrics only in support of judgement, never as a proxy for an individual’s worth; to recognise the hidden labour the narrative format is designed to surface; and to apply consistent qualitative criteria through a shared rubric, so that “narrative” does not become “unstructured and incomparable.”

    That last point is the live challenge. A narrative CV trades the false precision of metrics for the richer but less standardised evidence of prose, and prose is harder to compare across candidates. The answer is not to retreat to metrics but to develop shared rubrics so that narrative assessments are rigorous and fair rather than impressionistic.

    Where the dictionary fits

    Responsible assessment is awash with terms that every funder and institution defines slightly differently — narrative CV, contribution narrative, responsible metrics, hidden labour, team science. Without shared definitions, every reviewer reinvents their own rubric, which is exactly the inconsistency the movement is trying to escape. A shared, operational vocabulary for these concepts is what lets a narrative-CV reviewer at one institution mean the same thing as one at another. Providing that vocabulary — and pointing to DORA, CoARA, and UKRI for the normative content — is the convening role the CASRAI dictionary is built for. For a side-by-side account of the two frameworks, see our DORA versus CoARA comparison.

    What to do now

    For evaluators: read the work, use metrics only responsibly and in support of judgement, and engage seriously with the contributions a narrative CV surfaces. For institutions and funders: align practice with CoARA commitments and adopt narrative-CV formats with shared, qualitative rubrics so that assessments are comparable and fair. For standards work: define the responsible-assessment vocabulary operationally, federating to DORA, CoARA, and the funder narrative-CV guidance.

    Related reading

  • The SCOPE framework for responsible research evaluation: a practical model for designing fair evaluations

    The movement to reform research assessment has produced a powerful set of principles. Declarations and manifestos have told the community what to stop doing: stop using journal-based metrics as a proxy for the quality of individual articles, stop reducing complex contributions to a single number, stop letting convenient indicators substitute for judgement. These principles are essential, but they leave a practical gap. An evaluator — a panel chair, a research manager, a committee designing a hiring process — who agrees with all of them still has to design and run an actual evaluation, and “don’t do the bad things” is not, by itself, a method. The SCOPE framework, developed within INORMS (the International Network of Research Management Societies), exists to fill that gap by offering a structured process for designing a responsible evaluation. This article explains it, drawing on the responsible assessment domain of the CASRAI Dictionary.

    From principles to process

    The distinctive contribution of SCOPE is that it is a how-to, not a what-not-to. Where DORA and the Leiden Manifesto articulate the values and warn against the failure modes of assessment, SCOPE provides a sequence of steps an evaluator can actually follow to build an evaluation that honours those values. It treats the design of an evaluation as a deliberate act requiring thought, rather than a default to be reached for unreflectively. The name SCOPE is an acronym for the stages of that process, and working through them in order is meant to prevent the most common error in assessment: choosing the measure first — usually whatever is easy to count — and only afterwards, if at all, asking whether it actually captures what matters.

    The five steps

    SCOPE guides an evaluator through five stages:

    • S — Start with what you value. Before anything is measured, articulate what the evaluation is genuinely meant to recognise and encourage. This puts values, not available data, in the driving seat, and forces clarity about the purpose of the exercise.
    • C — Context considerations. Take account of the specific context: who or what is being evaluated, the discipline, the career stage, the conditions, and the consequences the evaluation will have. An approach appropriate in one context may be unfair or meaningless in another.
    • O — Options for evaluating. Consider the range of possible ways to conduct the evaluation — qualitative and quantitative, expert judgement and indicators — rather than defaulting to the most familiar tool. This is where the evaluator deliberately weighs alternatives.
    • P — Probe deeply. Interrogate the chosen approach. What are its limitations, biases and unintended effects? Who might it disadvantage? What behaviour will it incentivise? Probing before committing is how harms are caught in advance.
    • E — Evaluate your evaluation. After the exercise, assess whether the evaluation actually worked — whether it served its purpose, was fair, and had the intended effects — and feed what is learned back into future practice.

    The order is the point. By beginning with values and context and treating measurement as a later, considered choice, SCOPE structurally resists the temptation to let convenient metrics define what counts.

    How SCOPE relates to DORA, CoARA and the Leiden Manifesto

    SCOPE does not compete with the major assessment-reform initiatives; it operationalises them. The San Francisco Declaration on Research Assessment (DORA) sets out commitments to stop misusing journal-based metrics and to assess research on its own merits; SCOPE gives an evaluator a way to design assessments that actually deliver on those commitments. The Leiden Manifesto offers principles for the responsible use of metrics — supporting rather than supplanting expert judgement, accounting for context, recognising the limits of indicators — and SCOPE’s steps are, in effect, a procedure for honouring those principles in a concrete exercise. The Coalition for Advancing Research Assessment (CoARA) commits its many signatory organisations to reforming how they assess research; SCOPE is precisely the kind of practical tool such organisations need to translate their commitments into the design of real evaluations. In short, the declarations supply the why and the constraints; SCOPE supplies a disciplined way to do the work within them.

    Why a process matters

    It is worth dwelling on why a process, rather than a set of rules, is the right form for this. Research is too varied for a single prescribed method to fit every case: what is fair when assessing a senior researcher differs from what is fair for an early-career one; what makes sense in a laboratory discipline differs from a field where books and long-form scholarship dominate. A rigid rule (“always use X”) would simply replace one bad default with another. A process like SCOPE instead equips the evaluator to make a good, context-sensitive decision each time, while guarding against the predictable failure modes. It respects the irreducible role of judgement in assessment while ensuring that judgement is exercised thoughtfully and transparently rather than by reflex.

    Describing contribution for fairer assessment

    Responsible evaluation depends on having good information about what people have actually contributed, described in a way that does not collapse into crude proxies. This is where structured contribution information supports the goals of frameworks like SCOPE. The CRediT taxonomy — with its full set of contribution roles — lets an evaluation recognise the specific roles a person played rather than inferring contribution from authorship position or counting papers. Richer, structured information about contribution gives evaluators better material to exercise the considered judgement SCOPE is designed to support, and complements the narrative approaches increasingly used in responsible assessment. The institutional work of putting such practices in place is part of the broader remit of research administration.

    A consistent foundation for evaluation

    For responsible evaluation to work across institutions and systems, the information it draws on must be described consistently — contributions, outputs, roles and the rest. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the evidence feeding an evaluation means the same thing wherever it comes from. SCOPE reminds us that good assessment is something you design, not something you default into; a shared vocabulary helps ensure the materials you design with are sound.