Tag: test set contamination

  • System cards, benchmarks and evaluation suites as first-class research outputs

    A previous article in this series made the case for model cards and datasheets as documented research outputs. That is the well-established end of the spectrum. A wider and faster-moving set of AI/ML artefacts is now emerging as citable scholarship: system cards, evaluation benchmarks, and the evaluation suites that run against them. These are less settled than model cards, but they are arguably more consequential for how the field measures its own progress, and they belong in the conversation about what counts as a research output. CASRAI’s AI/ML research outputs domain treats them as first-class entries.

    From model cards to system cards

    A model card describes a model. A system card describes a deployed system — which is usually a model plus a great deal else: input and output filters, retrieval components, tool-use scaffolding, safety mitigations, usage policies, and the human processes wrapped around the whole thing. The distinction matters because the behaviour a user encounters is a property of the system, not the model in isolation.

    System cards became prominent as the major model developers began publishing them alongside frontier model releases, documenting not just capabilities and evaluations but the risk assessments, red-teaming exercises, and mitigation decisions that shaped the deployment. A system card typically records what the system can do, what failure modes were identified, what was done to mitigate them, and what residual risks remain. It is closer in spirit to a safety case than to a specification sheet.

    For the research record, the significance is that a system card documents decisions, not just facts. It is the contemporaneous account of how a team reasoned about deploying a powerful artefact responsibly. That reasoning is itself a research contribution — and, when the system later behaves in unexpected ways, the system card is the record against which those decisions can be assessed.

    Benchmarks as scholarly infrastructure

    An evaluation benchmark is a standardised dataset paired with a scoring protocol, used to measure and compare the capability of models on a defined task. Benchmarks are not new — they have organised whole subfields of machine learning for decades — but their status as research outputs has historically been ambiguous. A benchmark is often introduced in a paper, but the artefact that matters and gets reused for years afterwards is the dataset-and-protocol, not the paper describing it.

    The honest view is that benchmarks are among the highest-leverage outputs in the field. A widely adopted benchmark shapes what thousands of researchers optimise for; a flawed one steers the field toward the wrong objective. Building a good benchmark — defining the task crisply, assembling a representative dataset, designing a scoring procedure that resists gaming, and documenting all of it — is substantial and genuinely scholarly work. It deserves to be cited and credited as such, not buried as the apparatus of a single paper.

    Evaluation suites and the contamination problem

    An evaluation suite bundles multiple benchmarks and tasks together with the harness that runs them, so that a model release can be assessed consistently and reproducibly. The suite is the executable counterpart to the benchmark: where a benchmark is the standard, the suite is the instrument that applies it.

    Treating evaluation suites as research outputs surfaces a problem that the field takes seriously: test set contamination. When benchmark data leaks into the training corpus of the models being evaluated, the resulting scores measure memorisation rather than capability, and the benchmark silently loses its validity. Documenting an evaluation suite as a versioned, identifier-bearing output — with a clear record of which data is held out and when it was published — is part of how the community defends against contamination. A benchmark whose provenance and release date are pinned by a persistent identifier can be checked against a model’s training cut-off; an undocumented one cannot.

    Why first-class status matters

    The recurring theme across these artefact types is that the thing that gets reused is not the paper. It is the system card, the benchmark dataset, the evaluation harness. When these are treated as mere supplementary material, three things go wrong:

    • Credit is misallocated. The team that built a benchmark used by the whole field gets a single citation to a paper, while the artefact’s real influence is invisible to any contribution-based assessment.
    • Versioning is lost. Benchmarks and suites evolve — tasks are added, errors corrected, contaminated items removed. Without identifiers and versioning, it becomes impossible to say which version of a benchmark a given result was measured against.
    • Provenance breaks. The contamination defence above depends entirely on knowing exactly what a benchmark contained and when it was released. That is metadata, and metadata needs an identifier to hang on.

    How persistent identifiers and CRediT apply

    The infrastructure pattern mirrors that for model cards and datasets. A benchmark or evaluation suite is deposited in a repository and receives a DataCite DOI; each substantive revision receives its own version DOI, so that results can cite the exact version measured against. The executable harness is pinned by a Software Heritage ID. Contributors carry ORCID iDs, institutions carry ROR IDs, and where the work belongs to a larger programme a RAiD ties it together. CASRAI’s persistent-identifier guidance covers the deposit pattern.

    Contributorship maps onto CRediT with the same caveats discussed for models and datasets. Designing the benchmark task is Conceptualization and Methodology; assembling and annotating the data is Investigation and Data curation; building the harness is Software; running and verifying the evaluation is Validation. As with software papers, the Software role tends to be overloaded — a known limitation of applying CRediT to code-heavy outputs — but the statement is still more informative than an undifferentiated author list.

    What to do now

    For researchers: deposit benchmarks and evaluation suites as versioned outputs with DataCite DOIs and Software Heritage IDs, document them with the rigour of a datasheet, record the release date and held-out contents explicitly, and attach a CRediT statement. For funders and institutions: recognise system cards, benchmarks, and evaluation suites in CRIS systems and assessment as the high-leverage outputs they are, rather than as apparatus. For the field’s collective health, the benchmark you cite should be one you can name a version of.

    Related reading