Category: Guides & Explainers

Practical how-to guides, templates, checklists, and career pathways for research administrators, authors, and institutional teams.

Randomised Controlled Trials: The Gold Standard Explained

A randomised controlled trial (RCT) is an experimental study in which participants are allocated to an intervention group or a comparison group purely by chance, so that the only systematic difference between groups is the treatment under test. By combining randomisation, a control or comparison arm and, where possible, blinding, the RCT isolates the effect of an intervention from confounding factors, making it the methodological gold standard for answering causal questions.

The core insight is simple but powerful: if allocation is genuinely random and groups are large enough, known and unknown confounders are distributed evenly across arms. Any difference in outcome can then be attributed to the intervention rather than to pre-existing differences between participants.

Randomisation

Randomisation is the process of assigning participants to groups by chance — for example, by computer-generated sequence. Its purpose is to balance characteristics such as age, severity and unmeasured risk factors across arms, removing selection bias from the comparison. Without it, sicker or healthier participants might cluster in one group, distorting the result.

Allocation concealment

Allocation concealment ensures that those enrolling participants cannot foresee or influence which group a person will join. It is distinct from blinding: concealment protects the randomisation process at the point of assignment, whereas blinding operates after assignment. Poor concealment is one of the most consistently demonstrated sources of exaggerated treatment effects.

Control and comparison

A control or comparison arm provides the counterfactual — what would have happened without the intervention. Comparators may be a placebo, standard care or an active alternative. The placebo arm in particular controls for expectation effects, a topic explored in our article on the placebo and placebo effect.

Blinding

Blinding (or masking) prevents participants, clinicians or assessors from knowing group assignment, reducing conscious and unconscious bias. The mechanics of single, double and triple blinding, and the specific biases they address, are set out in our companion guide to double-blind studies and bias control.

Intention-to-treat analysis

Intention-to-treat (ITT) analysis evaluates participants in the groups to which they were randomised, regardless of whether they completed the assigned treatment. This preserves the benefits of randomisation and gives a realistic estimate of effectiveness in practice, where adherence is imperfect. The contrasting per-protocol analysis, which includes only those who followed the protocol, can reintroduce bias and is usually treated as secondary.

Why the RCT is the gold standard

For causal questions about whether an intervention works, the RCT’s design controls the main threats to validity in one structure. It sits at the heart of the confirmatory stage of drug development, as described in our overview of the pharmaceutical R&D pipeline, and underpins evidence-based decision-making across the research lifecycle.

Anatomy of a well-conducted RCT

A robust trial weaves these elements together rather than relying on any single one. The table below summarises the core components and the threat each addresses.

Component	Purpose	Threat addressed
Randomisation	Balance groups by chance	Confounding, selection bias
Allocation concealment	Hide upcoming assignment	Manipulation of enrolment
Control arm	Provide a counterfactual	Mistaking change for effect
Blinding	Conceal group membership	Performance and detection bias
Intention-to-treat	Analyse as randomised	Attrition and post-hoc selection

Power, sample size and pre-specification

Randomisation only balances groups reliably when the sample is large enough, which is why trials specify a target sample size derived from the smallest difference worth detecting. Too small a study may miss a real effect or produce an unstable estimate; an adequately powered one gives the result interpretive weight. Equally important is pre-specifying the primary outcome and analysis plan before the data are seen, so that a single confirmatory test is fixed in advance rather than chosen afterwards. This connects directly to the practice of preregistration and Registered Reports, which protects the trial’s confirmatory status from later analytic flexibility.

Where the RCT sits in the evidence hierarchy

A single trial, however well conducted, is rarely the final word. Findings gain strength when they are replicated and when multiple RCTs are combined in systematic reviews and meta-analyses, which sit above the individual trial in the evidence hierarchy. Conversely, a well-designed observational study can sometimes be more informative than a flawed or under-powered RCT. The design is a powerful tool, not an automatic guarantee of truth, and its value depends on execution and transparent reporting.

Internal versus external validity

Two distinct questions decide whether a trial is useful. Internal validity asks whether the result is true for the participants studied — whether the design genuinely isolated the intervention’s effect from bias and confounding. External validity asks whether that result generalises to other people, settings and conditions. The RCT excels at the first: randomisation, concealment, control and blinding are precisely the tools that secure internal validity. It is weaker on the second, because the controlled conditions and selected participants that protect internal validity can make a trial less representative of routine practice. Strong evidence requires attention to both, and the two sometimes pull in opposite directions.

Pragmatic versus explanatory trials

This tension has produced two broad trial styles. Explanatory trials test whether an intervention can work under ideal, tightly controlled conditions — maximising internal validity and answering questions of efficacy. Pragmatic trials test whether it does work in everyday clinical settings with broader participants and fewer restrictions — favouring external validity and answering questions of effectiveness. Neither is superior in the abstract; the right choice depends on the question being asked. A regulator confirming a causal effect may want an explanatory design, while a health system deciding whether to adopt a treatment may learn more from a pragmatic one. Reporting which style a trial used helps readers interpret how far its findings should travel.

Limits of the design

RCTs are not universally applicable. They can be expensive, may exclude populations seen in routine practice, and are sometimes unethical or impractical — you cannot randomise people to harmful exposures. Tightly controlled conditions can also limit generalisability, the gap between efficacy (does it work in the trial?) and effectiveness (does it work in the real world?). Transparent reporting and good documentation, as encouraged in our guidance for authors, help readers judge how far a trial’s findings extend.

Frequently asked questions

What makes randomisation so important?

Randomisation distributes both known and unknown confounders evenly across groups, so that observed differences in outcome can be attributed to the intervention rather than to pre-existing imbalances.

How is allocation concealment different from blinding?

Allocation concealment hides the upcoming assignment from those enrolling participants, protecting the randomisation itself. Blinding hides group membership after assignment to prevent biased behaviour and assessment.

Why use intention-to-treat analysis?

Analysing participants in their assigned groups preserves randomisation and gives a pragmatic estimate of effect under realistic adherence, avoiding bias introduced by excluding non-completers.

When is an RCT not appropriate?

When randomisation would be unethical, impractical or impossible — for example for harmful exposures or rare conditions — observational designs may be the only feasible option, accepting their greater vulnerability to confounding.

June 18, 2026

Large Language Models in Research: An Explainer

A large language model (LLM) is a type of artificial-intelligence model, built on the transformer neural-network architecture, that is trained on very large quantities of text to predict and generate language. At its core, an LLM learns the statistical patterns of language by repeatedly predicting the next token in a sequence; after training on enough text, this simple objective yields a system that can answer questions, summarise, translate and draft prose. Understanding how LLMs work — and where they fail — is now essential for researchers who use or evaluate them.

Transformers and tokens

The transformer, introduced in 2017, is the architecture underlying modern LLMs. Its key innovation is the attention mechanism, which lets the model weigh the relevance of different parts of the input when processing each element, capturing long-range relationships in text efficiently and in parallel. This made it practical to train far larger models than earlier sequence architectures allowed.

LLMs do not read words directly. Text is broken into tokens — units that may be whole words, parts of words or punctuation — and each token is converted into a numerical vector. The model processes sequences of these tokens and predicts the next one, assigning probabilities across its vocabulary. Generation proceeds token by token. Because models have a finite context window, the amount of text they can consider at once is bounded, which matters when working with long documents.

Pretraining and fine-tuning

LLMs are typically built in two stages. Pretraining exposes the model to a vast, broad corpus, during which it learns general language patterns through next-token prediction — this is the costly, compute-intensive stage. Fine-tuning then adapts the pretrained model to specific tasks or behaviours using smaller, targeted datasets. A widely used form of alignment further tunes models with human feedback so their responses are more helpful and follow instructions. This two-stage design is why a single pretrained base can be specialised for many downstream uses, connecting LLMs to the broader story of neural networks and deep learning.

Capabilities and limitations

LLMs are capable assistants for drafting, summarising, translating, extracting information and explaining concepts. But their limitations are intrinsic, not incidental, and researchers must keep them in view.

Capability	Corresponding limitation
Fluent, plausible text generation	Hallucination — confident but false statements
Broad knowledge from training data	Knowledge cut-off; no awareness of newer events
Summarising and synthesising sources	Weak provenance — cannot reliably cite where claims came from
Following instructions	Sensitivity to phrasing; potential to reflect training-data bias

The most important limitation for scholarship is hallucination: because an LLM generates statistically likely text rather than retrieving verified facts, it can produce fabricated references, false figures and incorrect claims stated with full confidence. It also lacks reliable provenance — it cannot, by default, tell you which source a statement came from. Outputs must therefore be independently verified, not trusted at face value.

Responsible use and disclosure in research

Used responsibly, LLMs can accelerate literature triage, drafting and coding. Used uncritically, they introduce errors, fabricated citations and undisclosed authorship concerns. Many journals and funders now require disclosure of generative-AI use in manuscripts, and most editorial policies hold that an LLM cannot be an author because it cannot take responsibility for the work. Good practice is to verify every factual claim and reference, keep a record of how the tool was used, and report that use transparently. Outputs produced or assisted by LLMs should be treated as research outputs subject to the same scrutiny and documentation as any other, described with consistent terminology. Our guidance for authors covers disclosure and documentation expectations, and reliable handling of model outputs intersects with sound data infrastructure and metadata practice.

Frequently asked questions

What is a token in a large language model?

A token is the unit of text an LLM processes — a whole word, part of a word, or punctuation. Text is split into tokens and converted to numerical vectors; the model predicts the next token in sequence. A model’s context window limits how many tokens it can consider at once.

What is the difference between pretraining and fine-tuning?

Pretraining teaches a model general language patterns from a vast, broad corpus and is computationally expensive. Fine-tuning then adapts that pretrained model to specific tasks or behaviours using smaller, targeted datasets, so one base model can be specialised for many uses.

Why do large language models hallucinate?

Because they generate statistically likely text rather than retrieving verified facts. An LLM predicts plausible continuations, so it can state fabricated references or false figures with full confidence. Outputs must be independently verified, since the model has no built-in mechanism guaranteeing factual accuracy.

Should I disclose using an LLM in my research?

Yes. Many journals and funders require disclosure of generative-AI use, and most hold that an LLM cannot be a named author. Verify all claims and references, record how the tool was used, and report that use transparently in line with relevant editorial policy.

June 18, 2026

FAIR Principles for Research Data Explained

FAIR data refers to research data managed according to four guiding principles — Findable, Accessible, Interoperable and Reusable — designed to maximise the value of data for both humans and machines. The principles were set out by Mark Wilkinson and colleagues in a landmark 2016 paper in Scientific Data and have since been adopted widely by funders, publishers and research institutions as a benchmark for good data stewardship. FAIR describes how data should be described, shared and preserved so that it can be discovered and reused long after a project ends.

A common misconception is that FAIR means “open”. It does not. FAIR is about good management and clear conditions of use; data can be FAIR while access remains controlled, which matters for sensitive or personal data.

What each principle means

The four principles work together, and the order spells the acronym rather than a strict sequence. Each rests heavily on metadata and persistent identifiers.

Principle	Core idea	Key enablers
Findable	Data and metadata are easy to locate by humans and machines	Persistent identifiers (e.g. DOIs), rich metadata, indexing
Accessible	Once found, data can be retrieved by a clear, open protocol	Standard protocols; metadata stays available even if data are restricted
Interoperable	Data can be combined and used with other data and systems	Shared vocabularies, standard formats, controlled terminologies
Reusable	Data are richly described and licensed for reuse	Clear licences, provenance, community standards and metadata

Findable requires that data and metadata carry globally unique, persistent identifiers and are described well enough to be indexed and searched. Accessible means the data can be retrieved using a standardised, open communication protocol, with authentication where needed — and, importantly, that metadata remain accessible even when the underlying data are not. Interoperable calls for data to use shared, standard formats and vocabularies so they can be integrated with other datasets and processed by different systems. Reusable requires rich description, clear provenance and an explicit usage licence so others can confidently build on the data.

The role of persistent identifiers and metadata

Two enablers run through all four principles: persistent identifiers and metadata. A persistent identifier — such as a DOI for a dataset or an ORCID for a researcher — provides a stable, resolvable reference that does not break when URLs change, underpinning findability and provenance. Metadata — structured information describing what the data are, how they were produced, and under what terms they may be used — is what makes data discoverable, interpretable and reusable. Crucially, FAIR treats metadata as valuable in its own right: rich, standardised metadata can remain open and findable even when the dataset itself is access-controlled. This is precisely the kind of standardised description that shared vocabularies, such as the CASRAI dictionary, and broader data infrastructure are built to support.

FAIR versus open

FAIR and open are related but distinct. Open data is data anyone can freely access, use and redistribute. FAIR data is well-managed, well-described data with clear access conditions — which may or may not be open. The principles’ own phrasing, “as open as possible, as closed as necessary”, captures the balance: maximise reuse while respecting legitimate constraints such as privacy, consent, commercial sensitivity or indigenous data rights. A dataset of patient records can be made FAIR — richly described, identified, governed and licensed — without being openly downloadable. Conversely, dumping a file online makes it open but not necessarily FAIR if it lacks identifiers, metadata or a licence.

For researchers, adopting FAIR practice means assigning identifiers, writing good metadata, using standard formats and stating licences from the outset rather than at the end of a project. Guidance on preparing and describing data is available in our resources for authors, and FAIR data underpins the reproducibility goals discussed across our research-outputs coverage.

Frequently asked questions

What does FAIR stand for?

FAIR stands for Findable, Accessible, Interoperable and Reusable. The four principles, published by Wilkinson and colleagues in 2016, describe how research data and metadata should be managed so they can be discovered, retrieved, combined and reused effectively by both humans and machines.

Does FAIR mean the same as open data?

No. Open data can be freely accessed and reused by anyone, whereas FAIR data is well-described and well-managed with clear access conditions that may be restricted. The guiding phrase is “as open as possible, as closed as necessary”, so sensitive data can still be FAIR.

Why are persistent identifiers important for FAIR data?

Persistent identifiers such as DOIs and ORCIDs provide stable, resolvable references that do not break when web addresses change. They underpin findability and provenance, letting data, researchers and outputs be reliably located and credited over the long term.

Can data be FAIR without being publicly downloadable?

Yes. FAIR requires clear access protocols and rich metadata, not unrestricted access. Metadata can remain findable and accessible even when the underlying dataset is controlled, so sensitive datasets can be made FAIR while access stays appropriately governed.

June 18, 2026

IEEE and AMA Citation Styles Explained

IEEE citation uses bracketed numbers in the text that point to a numbered reference list, and is standard across engineering and computer science. AMA citation, used widely in medicine, uses superscript numbers instead. Both are numeric systems, but they differ in formatting, ordering and discipline.

This guide explains how each style handles in-text markers and reference entries, with worked examples and a side-by-side table.

IEEE: numbers in square brackets

In IEEE style, each source is assigned a number the first time it is cited, in square brackets, and that number is reused for every later citation of the same source. References are listed in the order they first appear — not alphabetically.

In-text: Recent work on neural search has improved recall [1], and later studies confirmed it [2], [3].
Reused number: The original architecture [1] remains the baseline.
As a noun: As shown in [4], latency dropped sharply.

A reference-list entry abbreviates author first initials and places the number in brackets:

[1] J. Smith and A. Jones, “A scalable indexing method,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 4, pp. 110–128, 2021.

AMA: superscript numbers

AMA style places superscript numerals after the relevant text, again numbered in order of first appearance. The reference list follows the same numeric order. AMA dominates clinical and biomedical journals.

In-text: Adherence improved across the cohort.¹
Multiple sources: Several trials reported the same effect.^2,3
Range: The pattern held across studies.^4-6

A reference entry uses journal abbreviations and a specific punctuation pattern:

1. Smith J, Jones A. Outcomes in the treatment cohort. J Clin Res. 2021;12(3):110-128.

IEEE versus AMA at a glance

Feature	IEEE	AMA
Discipline	Engineering, computer science	Medicine, biomedicine
In-text marker	Square brackets [1]	Superscript ¹
List order	Order of appearance	Order of appearance
Author names	Initials before surname: J. Smith	Surname then initials: Smith J
Title style	Article title in quotes	Article title, no quotes
Journal name	Abbreviated, italic	Abbreviated, italic

Why discipline drives style choice

Numeric styles keep the running text uncluttered, which suits dense technical and clinical writing where a single sentence may lean on several sources. IEEE’s bracketed numbers double as compact cross-references to equations, figures and prior work; AMA’s superscripts keep medical prose readable at speed. Compare this with author-date approaches in our guide to Harvard referencing, where the author’s name carries into the sentence.

For a wider map of the field, see citation styles compared, and for general technique, our practitioner guide to citing sources.

Common pitfalls

The most frequent IEEE error is alphabetising the reference list — it must follow first-appearance order. The most frequent AMA error is mixing in author-date phrasing (“Smith showed¹”) inconsistently; keep the superscript doing the work. In both styles, every number in the list must be cited at least once in the text, and vice versa. Our for authors guidance covers reference hygiene before submission.

How citation style fits research outputs metadata

Citation style governs the visible reference; controlled vocabulary in our dictionary and contributor attribution through CRediT govern the structured metadata around it. Together they make a paper’s outputs machine-readable. Explore more in research outputs.

Frequently asked questions

Are IEEE and Vancouver the same?

They are close cousins — both numeric, both ordered by appearance — but differ in formatting detail, and Vancouver is associated with biomedicine while IEEE is associated with engineering. AMA is itself a Vancouver-derived medical style.

Do IEEE numbers go inside or outside punctuation?

IEEE brackets typically sit before the full stop, treated as part of the sentence: “…confirmed the result [2].”

Can I cite the same AMA source twice?

Yes — reuse its original number every time it appears, just as in IEEE.

Which style should a computer science thesis use?

IEEE is the conventional default for computer science and electrical engineering, but always follow your department’s or publisher’s stated requirement.

June 18, 2026

A Research Administrator’s Guide to CRIS (Current Research Information Systems)

Introduction to CRIS in Scholarly Spaces

Current Research Information Systems (CRIS) are the software backbones of university research administration. They collect, integrate, and showcase institutional research activities, linking researchers, publications, funding, and equipment in a single relational database.

The Structural Anatomy of a CRIS System

A CRIS system integrates data from payroll, student records, finance, and external scholarly databases (like Crossref and Scopus). It provides a central source of truth for university research, mapping the relationships between researchers (ORCIDs), publications (DOIs), organizations (ROR IDs), grants, and patent filings.

CRIS vs. Institutional Repository: Collaborative Integrations

While they sound similar, CRIS and Institutional Repositories (IRs) serve distinct purposes. A CRIS is administrative and evaluation-focused, managing university reporting and profile pages. An IR is publication-focused, dedicated to open-access manuscript preservation. Modern systems integrate the two, allowing a CRIS to trigger manuscript self-archiving in the IR automatically.

Selecting and Deploying a Standardized CRIS Platform

When deploying a CRIS, universities choose between commercial options (e.g., Pure by Elsevier, Symplectic Elements) and open-source models (e.g., VIVO, DSpace-CRIS). Key requirements include: 1. Support for the CERIF data model. 2. Automated API integration with scholarly indexers. 3. Robust privacy controls that protect personal salary and patent data.

Key Data and Comparative Metrics

CRIS Platform	Licensing Model	Primary Data Schema	Strength Areas
Pure (Elsevier)	Commercial (Proprietary)	Elsevier Custom / CERIF compatible	Deep integration with Scopus, rich profiling dashboards.
Symplectic Elements	Commercial (Proprietary)	Proprietary Schema / CERIF compatible	Highly customizable workflows, strong repository integrations.
DSpace-CRIS	Open-Source (Free)	CERIF compatible / Dublin Core extension	Direct integration of repository and CRIS, active developer community.

Actionable Checklist for CRIS

Formulate an institutional working group to define university CRIS requirements.: Formulate an institutional working group to define university CRIS requirements.
Ensure the selected CRIS platform fully supports the CERIF standard.: Ensure the selected CRIS platform fully supports the CERIF standard.
Integrate internal ERP, HR, and payroll databases with the CRIS.: Integrate internal ERP, HR, and payroll databases with the CRIS.
Configure automatic Crossref and ORCID API data harvesting feeds.: Configure automatic Crossref and ORCID API data harvesting feeds.
Establish user-friendly profile pages for faculty to showcase active projects.: Establish user-friendly profile pages for faculty to showcase active projects.

June 17, 2026

ORCID for researchers: connecting your identifier to your contributions
Most researchers now have an ORCID iD, often created in a hurry because a journal or funder asked for one. Far fewer have a record that actually does the work an identifier is meant to do. An ORCID iD that sits empty, or that you copy facts into by hand, delivers almost none of its value. The point of the identifier is connection — to your publications, your grants, your affiliations, and the wider identifier ecosystem — and that is what this guide is about. The foundational explainer lives at persistent identifiers for authors, and this article is the practical companion.

What an ORCID iD actually solves

An ORCID iD is a persistent, unique identifier for an individual researcher — a sixteen-digit number, expressed as an HTTPS URI, that stays with you across name changes, institution moves, and career stages. The problem it solves is name disambiguation: in a literature full of common surnames, initial variations, and transliterations, a string name cannot reliably tell two researchers apart, and cannot reliably tie one researcher’s scattered outputs together. The iD does both. It distinguishes you from every other researcher who shares your name, and it gathers your contributions under one unambiguous, machine-readable identity.

This is why funders and publishers increasingly require it. An ORCID iD on a submission or grant application means the work, the award, and the person can be linked without guesswork — the difference between a name a human must interpret and an identifier a system can resolve.

Step 1: register and complete the core of your record

Registration is free and takes minutes at orcid.org. The valuable part is what comes next: populating the record so it represents you. Add your employment and education affiliations, ideally selected from ORCID’s organisation lookup so they carry an organisation identifier rather than a free-typed string. Where the lookup is backed by ROR — the Research Organization Registry — your affiliation is anchored to a persistent organisation identifier, which is what lets systems reliably connect you to your institution. (For the organisation side of the ecosystem, see what is ROR.) Add alternative name forms and a short biography so that the record disambiguates you even where systems still rely on names.

Step 2: let trusted organisations write to your record

This is the step that turns a static profile into a living one, and it is the step most researchers skip. ORCID has a permissions model: you can grant a trusted organisation — a publisher, a funder, a repository, your institution’s research-information system — permission to read from and write to your record. Once granted, these systems can add works, grants, and affiliations for you, automatically and with provenance attached.
- Authorise Crossref and DataCite auto-update so that when you publish an article or deposit a dataset with your iD, the output appears on your record without manual entry.
- Grant your funders permission so that awards are written to your record from the authoritative source.
- Connect your institution’s system so affiliations and outputs stay synchronised.
The principle is enter-once, reuse-everywhere. A contribution asserted with your iD by a trusted source is more credible than one you typed yourself, because the assertion carries the provenance of the organisation that made it. The record stops being a CV you maintain and becomes a verified, auto-updating account of your work.

The single highest-value action most researchers can take with ORCID is to turn on auto-update permissions for Crossref and DataCite. After that, publishing with your iD maintains your record for you.

Step 3: use your iD everywhere it is asked for — and where it is not

An identifier only disambiguates if it is attached at the moment of contribution. Enter your ORCID iD on every manuscript submission, every grant application, every dataset deposit, and every peer-review record. Each time you do, you create a verified link between the work and your identity that flows into the connected systems. Conversely, an output published without your iD is one your record cannot automatically claim, and one that name-based systems may attach to the wrong person.

Step 4: connect ORCID to the rest of the identifier graph

ORCID is one node in a connected ecosystem, and its value compounds when it is linked to the others. Your iD identifies you; ROR identifies your organisations; a DOI identifies your outputs; a grant identifier identifies your funding; and a project identifier such as RAID identifies the activity that ties them together. When your outputs carry your ORCID iD and your institution’s ROR ID, and your awards carry grant identifiers linked to your iD, the graph assembles itself: a query can move from you to your works to your funders to your institution without a single hand-typed reconciliation.

This graph is also where contribution metadata lives. When a publisher records a CRediT statement and writes the relevant roles to your ORCID record alongside the publication, your iD begins to carry not just what you have published but what you did on each output — the richer, contribution-aware picture that responsible assessment depends on.

A note on what ORCID will and will not do

ORCID disambiguates and connects; it does not, by itself, validate the quality of a contribution or decide authorship. An auto-updated record is only as good as the assertions trusted sources write to it, and you remain responsible for reviewing your record and correcting errors. Keep the public-visibility settings deliberate, review incoming auto-updates periodically, and treat the record as something you curate, not something that runs entirely without you.

Where shared vocabulary fits

The identifier ecosystem works only when systems agree on what each identifier means and how they connect — what a “trusted organisation” permission grants, how an affiliation is asserted, how an output links to a person. A shared, federated vocabulary that defines these relationships and points back to ORCID and ROR for the authoritative infrastructure is what lets the graph hold together across systems. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the persistent-identifiers domain.

Related reading
June 17, 2026
Variance in Statistics: Definition and Formula

Variance is a measure of how spread out a set of values is, defined as the average of the squared deviations of each value from the mean. A large variance means the data points are widely dispersed; a small variance means they cluster tightly around the mean. Because the deviations are squared, variance is always non-negative and is expressed in squared units of the original measurement.

The definition of variance

To calculate variance, you first find the mean of the data, then subtract the mean from each value to get the deviations. Squaring each deviation removes the sign (so positive and negative deviations do not cancel) and gives greater weight to values far from the mean. The average of these squared deviations is the variance.

Variance is the foundation of many statistical methods, including the analysis of variance (ANOVA), regression diagnostics and the construction of confidence intervals. Reporting it transparently supports the goals set out in our reproducibility coverage.

Population variance versus sample variance

The formula depends on whether your data are the entire population or a sample drawn from it. For a population, you divide the sum of squared deviations by the number of values, N. For a sample, you divide by n − 1 instead of n. This adjustment, known as Bessel’s correction, produces an unbiased estimate of the population variance, because using the sample mean slightly underestimates the spread.

Quantity Symbol Divisor

Population variance σ² N

Sample variance s² n − 1

A worked conceptual example

Suppose five replicate measurements give 4, 8, 6, 5 and 2. The mean is (4 + 8 + 6 + 5 + 2) / 5 = 5. The deviations from the mean are −1, 3, 1, 0 and −3. Squaring these gives 1, 9, 1, 0 and 9, which sum to 20. Treating the five values as a population, the variance is 20 / 5 = 4. Treating them as a sample, the variance is 20 / 4 = 5. The sample figure is slightly larger, reflecting Bessel’s correction.

Variance and the standard deviation

Variance and the standard deviation describe the same property of spread, but in different units. The standard deviation is simply the square root of the variance, which returns the measure to the original units of the data. In our worked example the population standard deviation is √4 = 2. Because the standard deviation is easier to interpret alongside the mean, it is often reported in papers; see our companion piece on the standard deviation for detail. Variance, however, has convenient mathematical properties, which is why it underlies so many statistical procedures.

Interpreting variance correctly

Because variance is in squared units, its absolute size is hard to interpret in isolation. A variance of 4 cm² is meaningful only relative to the scale of the measurement. Variance is also sensitive to outliers: squaring magnifies the effect of extreme values, so a single anomalous point can inflate the variance substantially. Always inspect your data distribution before reporting variance, and define the term consistently in your methods. The CASRAI dictionary and our author guidance encourage precise, reproducible statistical reporting.

Frequently asked questions

Why is variance squared rather than absolute?

Squaring the deviations keeps the measure mathematically tractable and differentiable, which makes it the natural basis for least squares estimation and many other techniques. The absolute deviation is an alternative but lacks these convenient properties.

When should I divide by n − 1 instead of n?

Divide by n − 1 whenever your data are a sample used to estimate the variance of a wider population. Divide by N only when your data genuinely represent the entire population of interest.

Is a high variance bad?

Not inherently. High variance simply means greater spread. Whether that is good or bad depends on context: high variance in measurement error is undesirable, but natural biological variation may be expected and informative.

June 17, 2026

Quantity	Symbol	Divisor
Population variance	σ²	N
Sample variance	s²	n − 1

ANOVA (Analysis of Variance) Explained: Comparing Means Across Groups

Analysis of variance (ANOVA) is a statistical method that tests whether the means of three or more groups differ by more than would be expected from random variation alone. It does this by comparing the variance between group means against the variance within groups, summarised in a single F-statistic. ANOVA is one of the most widely used inferential tests in experimental research, and reporting it transparently is central to reproducible analysis.

Why ANOVA instead of multiple t-tests?

A t-test compares two group means. When you have three or more groups, it is tempting to run a separate t-test for every pair. The problem is the family-wise error rate: each test carries its own chance of a false positive, and those chances accumulate. With three groups there are three pairwise comparisons; at a 5% significance level the probability of at least one false positive rises to roughly 14%, and it climbs further as groups are added. ANOVA solves this by performing a single omnibus test that asks one question: are any of the group means different?

This control of error is why ANOVA underpins so much of experimental design. For a refresher on what significance thresholds mean in practice, see our explainer on p-values and statistical significance.

The F-statistic and how it works

ANOVA partitions the total variability in the data into two components. The between-groups variance reflects how far each group mean sits from the overall (grand) mean. The within-groups variance reflects the natural spread of observations inside each group. The F-statistic is the ratio of these two:

F = between-groups variance / within-groups variance

If the groups truly share a common mean, both quantities estimate the same underlying variability and F sits near 1. When real differences exist, the between-groups term grows and F rises. A large F, evaluated against the F-distribution with the appropriate degrees of freedom, yields a small p-value and signals that at least one mean differs.

One-way versus two-way ANOVA

The design depends on how many factors you are manipulating.

Feature	One-way ANOVA	Two-way ANOVA
Number of factors	One independent variable	Two independent variables
Example question	Does diet type affect plant growth?	Do diet type and watering frequency affect plant growth?
Main effects	One	Two (one per factor)
Interaction	Not assessed	Tests whether factors combine non-additively
Output	Single F-statistic	F-statistic for each main effect plus interaction

The key advantage of two-way ANOVA is the interaction effect: it reveals whether the influence of one factor depends on the level of another, something separate analyses would miss.

Assumptions you must check

ANOVA rests on three core assumptions. Observations should be independent. The residuals should be approximately normally distributed. And the groups should show roughly equal variances, a property called homogeneity of variance (homoscedasticity). When variances differ markedly, a Welch ANOVA is a robust alternative; when normality fails, a non-parametric Kruskal-Wallis test may be more appropriate. Stating which assumptions were tested, and how, is good practice and supports replication, as we discuss across our reproducibility coverage.

Post-hoc tests: locating the difference

A significant ANOVA tells you that some mean differs, but not which one. Post-hoc tests answer that follow-up while still controlling the family-wise error rate. Tukey’s HSD is the standard choice for all pairwise comparisons with equal sample sizes; Bonferroni correction is conservative and simple; Scheffe’s test is flexible for complex contrasts. Crucially, you should not revert to uncorrected t-tests after a significant ANOVA, as that reintroduces the inflated error the test was designed to prevent.

Equally important, statistical significance does not measure how large a difference is. Always pair ANOVA results with an effect size such as eta-squared, as covered in our companion piece on why effect size matters beyond significance. Authors planning a study should also budget adequate sample size and statistical power so a real effect can actually be detected.

Frequently asked questions

What does a significant ANOVA result actually tell you?

It tells you that at least one group mean differs from the others by more than chance would explain. It does not identify which groups differ or how large the difference is; you need post-hoc tests and effect sizes to answer those questions.

Can ANOVA be used for only two groups?

Yes. With two groups a one-way ANOVA gives results mathematically equivalent to an independent-samples t-test (F equals t squared). ANOVA’s real value appears with three or more groups, where it prevents the error inflation of multiple t-tests.

What is the difference between a main effect and an interaction?

A main effect is the overall influence of one factor averaged across the others. An interaction means the effect of one factor changes depending on the level of another. Detecting interactions is the principal reason to use two-way rather than one-way designs.

How should ANOVA results be reported for reproducibility?

Report the F-statistic with both degrees of freedom, the p-value, an effect size, the post-hoc method used, and confirmation that assumptions were checked. The CASRAI dictionary and our guidance for authors set out the metadata that makes such results auditable.

June 17, 2026

Harvard Referencing Style: A Complete Guide

Harvard referencing is an author-date citation system in which sources are cited in the text by the author’s surname and the year of publication, with full details listed alphabetically in a reference list. There is no single governing body for Harvard, so exact punctuation varies between universities; consistency within one document matters more than any universal rule.

This guide covers the in-text format, the reference-list format, the major institutional variants and worked examples you can adapt.

Why there is no single “Harvard” authority

Unlike APA, which is maintained by the American Psychological Association, Harvard is a family of author-date styles rather than a centrally published standard. Many institutions publish their own Harvard guide, and the British Standard BS ISO 690 informs several of them. The practical consequence: punctuation, italicisation and the use of “and” versus “&” differ between guides. Always follow the specific guide your department or journal mandates, and apply it uniformly.

For a wider comparison of conventions across the citation landscape, see our overview of citation styles compared.

In-text citations: (author, year)

Harvard in-text citations name the author and year in parentheses, adding a page number for direct quotations.

Paraphrase: Research participation rose sharply over the decade (Smith, 2021).
Direct quote: The findings were “unambiguous and replicable” (Smith, 2021, p. 14).
Author named in sentence: Smith (2021) argued that the trend was structural.
Two authors: (Smith and Jones, 2020) — some variants use an ampersand: (Smith & Jones, 2020).
Three or more: (Smith et al., 2019).

Where an author has several works in the same year, distinguish them with letters: (Smith, 2021a), (Smith, 2021b).

The reference list

The reference list appears at the end, ordered alphabetically by author surname. A typical journal-article entry reads:

Smith, J. (2021) ‘Patterns of research participation’, Journal of Research Methods, 12(3), pp. 110–128.

A book entry:

Jones, A. (2020) Designing the Research Question. 2nd edn. London: Academic Press.

Core elements are author, year, title, source and locator. Italicise the title of the standalone work (the journal or the book), and place the article title in quotation marks where the variant requires it.

Institutional variants you will meet

Because Harvard is decentralised, you will encounter small but stubborn differences. The table below illustrates a few common points of divergence; treat them as examples, not as ranked authorities.

Feature	Variant style A	Variant style B
Two-author connector	Smith and Jones	Smith & Jones
Year placement	Smith, J. (2021)	Smith, J., 2021.
Article title	‘In quotes’	No quotes, sentence case
Page abbreviation	pp. 110–128	110–128

The recurring lesson is the same one we stress in our for authors guidance: choose one variant, document it, and apply it without exception.

Citing electronic sources in Harvard

Web pages take the same author-date logic with an access date added, because online content can change. For a full walkthrough across styles, see our companion piece on how to cite a website correctly. For numeric and superscript alternatives used in engineering and medicine, see IEEE and AMA citation styles explained.

How Harvard fits the wider research output

Referencing is one strand of describing scholarship clearly. Controlled terminology in our dictionary and contributor roles via CRediT complement consistent citation by making the rest of a paper’s metadata as unambiguous as its reference list. Browse more in research outputs.

Frequently asked questions

Is Harvard the same as APA?

No. Both are author-date systems, but APA is a single published standard with fixed rules, whereas Harvard is a family of similar styles with institution-specific punctuation. Our APA essentials guide details the APA-specific conventions.

Which Harvard variant should I use?

The one your department, publisher or journal specifies. If none is mandated, pick a reputable institutional guide and follow it consistently throughout the document.

Do I need page numbers for paraphrases?

Page numbers are required for direct quotations and recommended when pointing to a specific passage. For general paraphrase of a whole source, author and year usually suffice.

How do I cite a source with no author?

Use the title in place of the author, or the organisation responsible. Practical strategies for missing metadata appear in our practitioner guide to citing sources.

June 17, 2026

Ethics review and the IRB/REC process: what researchers should expect
For research that involves people — their bodies, their behaviour, their data, their tissue — ethics review is not a bureaucratic hoop to clear before the real work begins. It is a substantive safeguard, the mechanism by which a community of researchers commits, in advance, that the people they study will be respected, protected and treated fairly. Researchers who approach it as a formality tend to find it frustrating; those who understand what it is trying to achieve usually find it navigable. This article explains what an ethics committee does, the review tiers a researcher will encounter, and the principles that underpin the whole system, drawing on the framework set out in the compliance and regulatory domain of the CASRAI Dictionary.

What the committee is called, and what it does

The body that conducts this review goes by different names in different places. In the United States it is the Institutional Review Board (IRB); in the United Kingdom and much of Europe it is the Research Ethics Committee (REC); in Australia it is the Human Research Ethics Committee (HREC). The names differ but the function is the same: an independent group, including both expert and lay members, that reviews proposed research involving human participants to ensure it is ethically acceptable before it proceeds.

What the committee weighs is consistent across these systems. It assesses whether the risks to participants are reasonable in relation to the anticipated benefits; whether participants will give genuinely informed and voluntary consent; whether the selection of participants is fair; whether privacy and confidentiality are adequately protected; and whether any vulnerable groups involved have additional safeguards. The committee’s independence matters because it is precisely the people closest to a project — its own investigators — who are least able to judge its risks dispassionately.

The tiers of review

One of the most useful things a researcher can understand early is that review is not one-size-fits-all. Most systems operate graded tiers of review scaled to the risk a study poses, and knowing which tier applies sets realistic expectations for time and scrutiny.
- Exempt review is for certain categories of low-risk research — for example some research using anonymised existing data, or certain educational and survey studies — that meet defined criteria. ‘Exempt’ does not mean no review at all; it usually means the committee, not the investigator, confirms that the exemption applies.
- Expedited review is for research that poses no more than minimal risk and falls within specified categories. It is conducted by one or a few experienced reviewers rather than the full committee, which makes it quicker without lowering the standard for the questions asked.
- Full board review is for research that involves more than minimal risk, vulnerable populations, or sensitive interventions. The whole convened committee considers it, and this is the most thorough — and necessarily the slowest — route.
The single most common cause of frustration is a mismatch of expectation: submitting a higher-risk protocol and expecting an expedited timeline. Identifying the likely tier at the planning stage, and building the corresponding time into the project, prevents most of that friction.

The Declaration of Helsinki and its lineage

None of this arose in a vacuum. The modern ethics-review system rests on a series of foundational documents written in response to historical abuses. The Declaration of Helsinki, developed by the World Medical Association, is the central statement of ethical principles for medical research involving human subjects, and it is periodically revised to keep pace with new challenges. It articulates duties that have become the bedrock of review: the wellbeing of the individual participant takes precedence over the interests of science and society; participation must be voluntary and informed; risks must be minimised and justified; and research must be conducted by suitably qualified people under proper protocols.

Alongside Helsinki sit other touchstones — in the United States, the principles articulated in the Belmont Report (respect for persons, beneficence and justice) and the federal Common Rule that operationalises them. A researcher does not need to memorise these documents, but understanding that the committee’s questions descend from them helps make sense of why it asks what it asks.

Informed consent, done properly

If one element sits at the centre of review, it is informed consent. Consent is not a signature on a form; it is a process by which a potential participant comes to understand what the research involves, what risks and benefits it carries, that participation is voluntary, and that they may withdraw without penalty. Committees scrutinise consent materials closely — for readability, completeness and honesty — and pay particular attention where consent is complicated: research with children, with adults who lack capacity, in emergency settings, or across cultural and language differences. The recurring expectation is that the participant genuinely understands and genuinely chooses, not merely that a box has been ticked.

Working with the process, not against it

Researchers get the most out of ethics review by treating the committee as a collaborator in protecting participants rather than as an obstacle. That means engaging early, before a protocol is locked; writing the application for an intelligent non-specialist, since lay members are part of the point; being candid about risks rather than minimising them, because a committee trusts an application that confronts its own weaknesses; and remembering that review continues after approval, through reporting of adverse events, amendments and, often, continuing review. Recording ethics approvals and their status as structured compliance metadata — alongside other obligations and the recognition of contributors through the CRediT taxonomy — helps keep this information visible across the research record rather than buried in a filing cabinet. The consistent vocabulary for describing ethics review, approval status and the wider compliance landscape is maintained in the CASRAI Dictionary.
June 16, 2026

Category: Guides & Explainers

Randomisation

Allocation concealment

Control and comparison

Blinding

Intention-to-treat analysis

Why the RCT is the gold standard

Anatomy of a well-conducted RCT

Power, sample size and pre-specification

Where the RCT sits in the evidence hierarchy

Internal versus external validity

Pragmatic versus explanatory trials

Limits of the design

Frequently asked questions

What makes randomisation so important?

How is allocation concealment different from blinding?

Why use intention-to-treat analysis?

When is an RCT not appropriate?

Transformers and tokens

Pretraining and fine-tuning

Capabilities and limitations

Responsible use and disclosure in research

Frequently asked questions

What is a token in a large language model?

What is the difference between pretraining and fine-tuning?

Why do large language models hallucinate?

Should I disclose using an LLM in my research?

What each principle means

The role of persistent identifiers and metadata

FAIR versus open

Frequently asked questions

What does FAIR stand for?

Does FAIR mean the same as open data?

Why are persistent identifiers important for FAIR data?

Can data be FAIR without being publicly downloadable?

IEEE: numbers in square brackets

AMA: superscript numbers

IEEE versus AMA at a glance

Why discipline drives style choice

Common pitfalls

How citation style fits research outputs metadata

Frequently asked questions

Are IEEE and Vancouver the same?

Do IEEE numbers go inside or outside punctuation?

Can I cite the same AMA source twice?

Which style should a computer science thesis use?

Introduction to CRIS in Scholarly Spaces

The Structural Anatomy of a CRIS System

CRIS vs. Institutional Repository: Collaborative Integrations

Selecting and Deploying a Standardized CRIS Platform

Key Data and Comparative Metrics

Actionable Checklist for CRIS

What an ORCID iD actually solves

Step 1: register and complete the core of your record

Step 2: let trusted organisations write to your record

Step 3: use your iD everywhere it is asked for — and where it is not

Step 4: connect ORCID to the rest of the identifier graph

A note on what ORCID will and will not do

Where shared vocabulary fits

Related reading

The definition of variance

Population variance versus sample variance

A worked conceptual example

Variance and the standard deviation

Interpreting variance correctly

Frequently asked questions

Why is variance squared rather than absolute?

When should I divide by n − 1 instead of n?

Is a high variance bad?

Why ANOVA instead of multiple t-tests?

The F-statistic and how it works

One-way versus two-way ANOVA

Assumptions you must check

Post-hoc tests: locating the difference

Frequently asked questions

What does a significant ANOVA result actually tell you?

Can ANOVA be used for only two groups?

What is the difference between a main effect and an interaction?

How should ANOVA results be reported for reproducibility?

Why there is no single “Harvard” authority