Tag: IGSN

  • Identifiers for Things, Not Just Papers: IGSN and PIDINST

    When researchers think about persistent identifiers, they usually picture DOIs on papers and datasets or ORCID iDs on people. Yet a great deal of research turns on physical things: a sediment core drilled from a lake bed, a tissue specimen in a biobank, a water sample from a particular depth on a particular day, or the spectrometer that analysed it. These physical research objects have historically been referred to by inconsistent local labels, if they were referred to at all. Two complementary efforts, the IGSN for samples and the PIDINST work for instruments, set out to give them stable, global identifiers.

    Why physical objects need PIDs

    The case for identifying physical objects mirrors the case for identifying any research output. A persistent identifier lets a sample or instrument be referred to unambiguously across publications, datasets, and laboratories. It allows the measurements derived from a sample to be linked back to the sample itself, and onward to the instrument that produced them. Without such links, reuse and verification become difficult: a reader cannot easily tell whether two studies analysed the same specimen, or whether a calibration problem on a particular instrument might affect a body of results. Persistent identification turns scattered physical objects into nodes in a connected research graph, supporting the goals of FAIR data.

    IGSN: identifiers for samples

    The IGSN began in the geosciences as the International Geo Sample Number, a way to give individual physical samples a globally unique identifier so that specimens could be tracked and cited across the literature. As the approach proved useful beyond geology, the system evolved. The IGSN is now implemented as an IGSN ID, issued through DataCite, which brought sample identification into the same DOI-based infrastructure used for datasets and other outputs. This alignment means a sample can carry a resolvable identifier, a landing page, and structured metadata describing what the sample is, where and when it was collected, and how it relates to other objects.

    The practical effect is that a physical specimen becomes a citable entity. A paper can reference the exact sample it analysed; a dataset can link each measurement to the sample it came from; and a repository can expose the provenance of its holdings. For disciplines that depend on irreplaceable physical material, from earth science to the life sciences, this is a meaningful advance in traceability.

    PIDINST: identifiers for instruments

    Where IGSN addresses samples, the PIDINST working group, convened under the Research Data Alliance, addressed the instruments themselves. The group developed a metadata schema for persistent identification of measuring instruments, so that a microscope, sensor, telescope, or analytical device can be referenced by a persistent identifier and described in a consistent way. The schema captures the kind of information that makes an instrument identifiable and useful to cite: what it is, who owns or operates it, its model and configuration, and identifiers for related entities such as the institution that hosts it.

    Identifying instruments matters because the measuring apparatus is part of the methods. When the data from an experiment can be linked to the specific instrument that produced them, it becomes possible to assess instrument-related effects, to credit the facilities that maintain expensive equipment, and to trace a result from a published figure all the way back to the device on a laboratory bench.

    Connecting the chain of provenance

    The real power of these identifiers appears when they are used together. Imagine a measurement linked to the instrument that produced it via a PIDINST identifier, the sample it was taken from via an IGSN ID, the dataset it belongs to via a DataCite DOI, and the researchers responsible via their ORCID iDs. Each link is a small piece of metadata, but together they describe an unbroken chain of provenance from a published claim back to the physical objects and people behind it. That is precisely the kind of connected, machine-actionable record that modern research infrastructure aspires to.

    Towards a fully identified research record

    Extending persistent identification to samples and instruments fills two of the larger gaps in the research record. Articles, data, organisations, and people increasingly carry stable identifiers; physical objects and the apparatus that measures them have lagged behind. By bringing samples into the DataCite ecosystem as IGSN IDs and by giving instruments a shared metadata schema through PIDINST, the community is steadily closing those gaps. The vocabularies and crosswalks that hold such a record together are the kind of standards work catalogued in the CASRAI data dictionary, and they complement contributor frameworks such as CRediT by anchoring the human contributions to the physical things they acted upon.

  • Identifying instruments and samples: PIDINST and IGSN

    Over the past two decades, research has built an impressive web of persistent identifiers. Articles have DOIs, datasets have DOIs, researchers have ORCID iDs, organisations have ROR identifiers, and grants and projects are increasingly identified too. Follow any one of these and you can traverse the others — this person wrote that paper, which used this dataset, funded by that grant. But there have long been two conspicuous gaps in this graph, both at the point where research meets the physical world: the instruments that generate measurements, and the physical samples from which data are drawn. Two community efforts — PIDINST for instruments and IGSN for samples — are now closing those gaps. This article explains both and where they fit, drawing on the persistent identifiers domain of the CASRAI Dictionary.

    Why instruments and samples need identifiers

    Consider a measurement. To interpret it properly — to reproduce it, to compare it with another, to assess its reliability — you need to know what produced it: which spectrometer, which sensor, which sequencer, in what configuration and with what calibration history. And to know what the measurement is of, you need to identify the physical sample: which rock core, which water sample, which tissue specimen, collected where and when. Traditionally this provenance was described in prose, in ways that were inconsistent between papers and impossible to resolve automatically. Two papers might use the same instrument or analyse splits of the same sample without any way to know it. Persistent identifiers for instruments and samples make that provenance explicit, resolvable and connectable to the rest of the PID graph.

    PIDINST: persistent identifiers for instruments

    PIDINST is a community framework, developed under the auspices of the Research Data Alliance, for assigning persistent identifiers to research instruments and describing them with a shared metadata schema. The idea is that a significant instrument — a telescope, a mass spectrometer, a research vessel’s sensor array — receives a persistent identifier and a structured description covering attributes such as its owner, manufacturer, model, and where it is located or operated. Once an instrument has a resolvable identifier, data it produces can cite it, the instrument can be linked to the people and institutions responsible for it, and its outputs can be aggregated across studies. PIDINST is deliberately infrastructure-agnostic: it defines the metadata and the principle of persistent identification rather than mandating a single issuing body, allowing existing identifier systems to carry instrument PIDs.

    IGSN: identifiers for physical samples

    On the samples side, the IGSN — originally the International Geo Sample Number, now stewarded as a global sample identifier — provides persistent, resolvable identifiers for physical specimens. An IGSN identifies a particular sample: a sediment core, a mineral specimen, a biological sample, with metadata describing what it is, where and when it was collected, and how it relates to parent samples and sub-samples. This last point matters enormously in practice, because samples are routinely split, sub-sampled and distributed; IGSN can express the relationships between a parent sample and its derivatives, so that analyses performed on different splits can be traced back to a common origin. The IGSN system has been integrated with the DataCite infrastructure, aligning sample identifiers with the same resolution and metadata ecosystem used for datasets — which means a sample can be cited and linked just as a dataset can.

    A note on RRIDs

    Related to the question of identifying research resources are Research Resource Identifiers (RRIDs), which identify key biological resources used in research — antibodies, cell lines, model organisms, and software tools — so that the exact resource behind a result can be unambiguously named and found. RRIDs address a different layer from PIDINST and IGSN: not the instrument that measured or the unique physical specimen, but the catalogued, often commercially available resources whose precise identity is essential to reproducibility. Together, instrument PIDs, sample identifiers and resource identifiers fill in the parts of the provenance picture that dataset and article DOIs never reached.

    Completing the provenance chain

    The power of these identifiers is realised when they are connected. Picture a fully linked record: a dataset (DOI) was produced by an instrument (PIDINST) operated by a researcher (ORCID) at an institution (ROR), measuring a sample (IGSN) collected on a particular expedition, using a reagent identified by an RRID, all under a grant (grant ID). Each link is resolvable; the whole forms a provenance chain that a machine can traverse and a human can audit. That is a qualitatively better basis for reproducibility and reuse than a methods section written in prose, because every node can be verified against an authoritative record rather than taken on trust.

    Using them in practice

    For researchers, adopting these identifiers is becoming more straightforward as repositories and data-collection workflows build in support. The practical advice is to assign and cite instrument and sample identifiers at the point of data creation rather than retrofitting them later, and to record the relationships — instrument to data, parent sample to sub-sample — while they are still known. Our guidance on persistent identifiers for authors covers how to incorporate these into the research record, and the consistent definitions that let an instrument PID or sample identifier mean the same thing across systems are maintained in the CASRAI Dictionary. As with people and outputs, recognising the contributions of those who build and steward instruments and sample collections is part of a complete record, and structured contribution through the CRediT taxonomy helps make that work visible too.

  • The five-PID stack: ORCID, ROR, RAiD, DOI and IGSN working together

    Persistent identifiers are often introduced one at a time — here is ORCID for researchers, here is the DOI for publications — as if each solved its own isolated problem. That framing undersells them badly. The real power of persistent identifiers is not in any single one but in how they interlock. Each identifies one kind of entity; together they form a connected graph of the research enterprise. This article looks at five identifiers that, taken as a stack, cover the core entities of research — people, organisations, projects, outputs, and physical samples — and shows how they fit together. It draws on the persistent-identifiers domain.

    One identifier per kind of thing

    The organising insight is that each major identifier answers a different question. Get the right identifier for each kind of entity, and the entities can be linked unambiguously.

    • ORCID iD identifies a person — an individual researcher — with a persistent identifier issued by ORCID. It answers “who?” and resolves the perennial problem of name ambiguity, where one researcher’s work is scattered across spelling variants and shared surnames.
    • ROR ID identifies an organisation — a research institution — through the Research Organization Registry. It answers “where?” and collapses the dozens of ways a single university’s name can be written into one canonical, resolvable identifier.
    • RAiD identifies a project — a research activity — under the ISO 23527:2022 standard. It answers “what undertaking?” and gives the connecting activity its own identity rather than leaving it implicit in the outputs.
    • DOI identifies an output — a publication, dataset, or piece of software. Issued through registration agencies such as Crossref (for publications) and DataCite (for data and software), it answers “what was produced?”
    • IGSN identifies a physical sample — the International GeoSample Number, now governed within the DataCite ecosystem. It answers “which specimen?” and extends persistent identification from the digital world to the physical materials that research is done on.

    Five identifiers, five kinds of entity: person, organisation, project, output, sample. Between them they cover the entities that nearly every piece of research involves.

    How they interlock

    The value appears when the identifiers reference one another. Consider a single field campaign. A team of researchers, each with an ORCID iD, based at institutions each with a ROR ID, conducts a project with a RAiD. In the field they collect rock samples, each registered with an IGSN. They analyse the samples and publish a paper and a dataset, each with a DOI. Now watch the connections: the paper’s DOI metadata lists the authors by ORCID and their affiliations by ROR; the dataset’s DOI references the IGSNs of the samples it describes; both outputs link to the project’s RAiD; and the RAiD record, in turn, aggregates the people, the institutions, the samples, and the outputs.

    The result is a graph in which you can start from any node and traverse to the others. From a sample’s IGSN you can reach the dataset that measured it, the paper that interpreted it, the project that collected it, the people who did the work, and the institutions they belong to — all by following identifier references, with no name-matching guesswork. This is the PID graph: the network of relationships formed by linking persistent identifiers, and the substrate on which automated systems can reason across the research enterprise.

    Why the stack beats any single identifier

    Any one of these identifiers is useful on its own, but each has a ceiling that only the stack removes. A DOI tells you an output exists, but matching its authors to real people requires ORCID; matching its affiliations to real institutions requires ROR; placing it in the context of a project requires RAiD; and connecting it to the physical materials behind it requires IGSN. The DOI’s metadata is only as connected as the identifiers it can reference. The same is true of every identifier in the set: each becomes dramatically more powerful when the entities it points to are themselves identified.

    This is why the maturation of the whole ecosystem, rather than any single scheme, has been the significant development of recent years. ROR reached near-universal adoption and gave organisations a clean identifier; RAiD became an ISO standard and filled the project-shaped hole in the middle of the graph; IGSN moved into the DataCite ecosystem and aligned physical-sample identification with digital-output identification. The pieces stopped being five separate good ideas and started being one connected fabric.

    The supporting cast

    The five-PID stack is the core, but it does not stand entirely alone, and it is worth knowing the adjacent identifiers it connects to. Software Heritage IDs (SWHIDs) pin exact source-code states, complementing the DataCite DOIs that make software citable. The Crossref Funder Registry and Crossref grant IDs identify funders and individual awards, so that the funding behind a project’s RAiD is itself identified. DMP IDs identify data-management plans. These extend the graph further into the lifecycle, but the five core identifiers are the ones that cover the entities every project has.

    Where the dictionary fits

    Most research administrators do not yet hold a clear mental model of how these identifiers fit together — which is the single most common gap the persistent-identifier ecosystem now presents. The schemes are mature; the understanding is not. A dictionary that defines each identifier operationally, makes its relationships explicit, and shows how the stack interlocks is exactly the integrative reference the ecosystem is missing. Providing that map — and federating each definition back to its authoritative steward, from ORCID to DataCite to ARDC — is the role the CASRAI dictionary is built to play.

    What to do now

    For researchers: register for an ORCID iD, ensure your institution’s ROR ID is used on your outputs, mint DOIs for your datasets and software as well as your papers, and use IGSNs for physical samples where your discipline supports them. For institutions: drive identifier coverage across all five entity types, because the graph is only as connected as its sparsest identifier. For the ecosystem: keep federating, so that an identifier minted in one scheme can reference an identifier in another without friction.

    Related reading