publication harvest – CASRAI Dictionary

Most universities run a system that quietly underpins a great deal of their research administration, and most researchers could not name it. It is the Current Research Information System (CRIS) — the institutional backbone that ties together who the researchers are, what projects they run, who funds them, and what they produce. This article gives a plain-language account of what a CRIS does, why it matters, and why it depends so heavily on shared vocabulary. It draws on the research-information systems domain.

CRIS and RIM: the system and the function

Two terms travel together and are easily confused. A CRIS is the software system. Research Information Management (RIM) is the broader discipline and practice of managing research information — the function that the CRIS supports. RIM is what a research office does; the CRIS is the tool it uses to do it. Both terms appear because the same activity is described from two angles: the operational system and the professional practice. Familiar CRIS products include Pure, Symplectic Elements, Converis, Worktribe, and the open-source VIVO and DSpace-CRIS.

What a CRIS actually holds

A CRIS is, at heart, a set of connected records about a handful of entity types and the relationships between them. The core entities are people, organisational units, projects, funding, and outputs. The value is in the connections: this researcher, in this department, leads this project, funded by this award, which produced these publications and datasets. Each entity is a record; the CRIS is the graph that joins them.

The researcher profile is the entity most people encounter. It aggregates a person’s affiliations, outputs, projects, and activities into a single record — the thing that often surfaces as a public staff page. Behind it sits an organisational hierarchy: the structured representation of departments, schools, institutes, and centres, so that the system can roll outputs and funding up to any level of the institution. The quality of that hierarchy determines whether “how much did the School of Engineering publish last year?” is a one-click query or a week of manual work.

The core job: getting data in

A CRIS is only as useful as the data in it, and the central operational challenge is keeping that data current without burying researchers in data entry. Two mechanisms do most of the work. A publication harvest automatically imports publication metadata from external sources — Crossref, Scopus, Web of Science, PubMed, ORCID — so that a researcher’s output list populates itself rather than being typed in. A funder ingest imports funding and award metadata, so that grants appear against the right people and projects.

Neither mechanism is reliable without identifiers. A publication harvest that matches on author name alone will mis-assign work by every researcher who shares a surname; matching on ORCID iD resolves the person unambiguously. A funder ingest that matches on institution name will fragment one university across a dozen spelling variants; matching on ROR ID collapses them to one. This is why the maturation of the persistent-identifier ecosystem has done more for CRIS data quality than any feature in the software itself.

Disambiguation, enrichment, validation

Three less-visible activities determine whether a CRIS is trusted. Disambiguation is the process of resolving ambiguous identifications — two authors with the same name, two spellings of one organisation — to canonical entities. Enriched metadata is metadata improved with information from external sources: adding Crossref Funder Registry IDs to funding records, adding ROR IDs to affiliations, adding DOIs to outputs that arrived without them. A validation rule is a check applied during ingest to enforce data quality — rejecting a publication record with no identifier, flagging an award whose dates fall outside its project. Together these turn a heap of imported records into a research-information asset an institution can report from with confidence.

What the CRIS is for

The reason institutions invest in a CRIS is that the same research-information facts are needed, repeatedly, for many different purposes. Annual reporting, research assessment exercises, open-access compliance monitoring, public staff and project pages, internal resource allocation, and responses to funder audits all draw on the same underlying entities. Without a CRIS, each of these is a separate data-gathering exercise; with one, they are views over a single maintained graph. The CRIS is the institution’s single source of truth for research information, and its value is exactly proportional to how trustworthy that single source is.

This is also why a CRIS connects outward. It is not an island: it harvests from Crossref and ORCID, it can push validated publications to a repository, it feeds open-access compliance dashboards, and increasingly it exchanges project information using shared models. A modern CRIS is a node in an institutional and sectoral information fabric, not a closed database.

Why shared vocabulary is the precondition

Here is the catch that connects the CRIS to CASRAI’s mission. Every CRIS implementation that invents its own field names — its own way of recording an ethics status, an output type, a project phase, a funding category — creates a system that cannot exchange data cleanly with any other. The harvests work because Crossref, ORCID, and ROR provide shared identifiers and shared metadata. The internal records often do not interoperate, because each institution structured them locally. A controlled, shared vocabulary for the entities and attributes a CRIS holds is what would let research information move between institutions as cleanly as it now moves in from the identifier providers. Supplying that definitional layer is the convening role the CASRAI dictionary exists to play.

What to do now

For institutions running a CRIS: invest in identifiers first — ORCID and ROR adoption do more for data quality than any feature. Treat disambiguation, enrichment, and validation as ongoing operations, not one-off projects. For those procuring or integrating systems: use vendor-neutral, shared vocabulary to specify what you need, so the conversation is about your requirements rather than one product’s field names.

Tag: publication harvest

What a CRIS does: the research-information backbone explained