Category: Guides & Explainers

Practical how-to guides, templates, checklists, and career pathways for research administrators, authors, and institutional teams.

  • Overfitting and Underfitting in Machine Learning Explained

    Overfitting occurs when a machine learning model learns the noise and quirks of its training data so closely that it performs well on that data but poorly on new, unseen data. Its opposite, underfitting, occurs when a model is too simple to capture the underlying pattern and performs poorly even on the training data. Balancing these two failure modes is one of the central challenges of building reliable, reproducible models.

    The bias-variance trade-off

    Underfitting and overfitting are two sides of the bias-variance trade-off. Bias is error from overly simplistic assumptions; a high-bias model misses real structure and underfits. Variance is error from excessive sensitivity to the training sample; a high-variance model chases noise and overfits. As you make a model more flexible, bias falls but variance rises. The art is to find the sweet spot where total error, the sum of both, is lowest. A model that generalises well sits between the extremes.

    Aspect Underfitting Good fit Overfitting
    Model complexity Too low Appropriate Too high
    Bias High Balanced Low
    Variance Low Balanced High
    Training accuracy Poor Good Excellent
    Test accuracy Poor Good Poor

    The tell-tale sign of overfitting is a large gap between strong training performance and weak test performance. Underfitting shows up as poor performance on both.

    Train, validation and test splits

    Diagnosing these problems requires holding data back. The convention is a three-way split: the training set fits the model, the validation set tunes choices such as model complexity and stopping point, and the test set is touched only once, at the end, to estimate real-world performance. Evaluating on data the model trained on always flatters it and hides overfitting. Keeping the test set genuinely untouched is fundamental to honest evaluation, a point we stress across our AI and ML research outputs coverage.

    Regularisation: penalising complexity

    Regularisation discourages a model from becoming too complex by adding a penalty for large or numerous parameters. L1 (lasso) regularisation can shrink some weights to zero, effectively performing feature selection. L2 (ridge) regularisation shrinks weights smoothly towards zero without eliminating them. In neural networks, techniques such as dropout, which randomly disables units during training, and early stopping, which halts training before the model starts memorising, serve the same goal. Each nudges the model towards simpler, more generalisable solutions.

    Cross-validation: a more robust check

    A single train-validation split can be lucky or unlucky. Cross-validation guards against this by rotating the validation role across the data. In k-fold cross-validation, the data is divided into k parts; the model trains on k-1 parts and validates on the remaining one, repeating until every part has served as validation once. Averaging the results gives a more stable estimate of how the model will generalise, and a smaller chance of being fooled by a single fortunate split. To learn how these ideas fit into the wider discipline, see what is machine learning.

    Why this threatens reproducible ML

    Overfitting is a leading cause of results that fail to replicate. A model tuned too tightly to one dataset, or evaluated with leakage between training and test data, can report impressive accuracy that collapses when applied elsewhere. Honest splits, regularisation, cross-validation and full reporting of hyperparameters are the defences. We discuss these safeguards in depth in reproducibility of machine learning research, and the consistent terminology to describe them lives in the CASRAI dictionary. As with classical statistics, adequate data matters: too few examples make overfitting almost inevitable, echoing the concerns in our guide to sample size and statistical power.

    Frequently asked questions

    How can I tell if my model is overfitting?

    Compare training and test performance. A model that scores very high on training data but noticeably worse on held-out test data is overfitting. If it performs poorly on both, it is underfitting.

    What is the simplest way to reduce overfitting?

    Gather more representative data, simplify the model, and apply regularisation or early stopping. Cross-validation helps you confirm that your fix genuinely improves generalisation rather than just luck.

    What is the bias-variance trade-off in one sentence?

    It is the tension between a model being too simple to capture the pattern (high bias, underfitting) and too flexible so it captures noise (high variance, overfitting), with the best model balancing the two.

    Why does overfitting harm reproducibility?

    An overfitted model reports performance specific to one dataset that does not carry over to new data, so its results fail to replicate. Honest data splits and transparent reporting, as described in our guidance for authors, are the remedy.

  • Statistical Software in Research: R, SPSS, SAS, Stata and Python Compared

    Statistical software is the family of applications researchers use to manage, analyse and visualise data. The dominant tools in research are R, SPSS, SAS, Stata and the Python data stack. The choice between them shapes not only what analyses are convenient but how reproducible the work is, because scripted analysis leaves an auditable record that point-and-click clicking does not.

    The main tools at a glance

    Software Licence Typical strength Reproducibility profile
    R Open source Vast statistical and graphics ecosystem Strong — script-first, scales to literate documents
    Python (pandas/statsmodels) Open source General-purpose, data science and ML integration Strong — script-first, notebooks and pipelines
    Stata Proprietary Econometrics, epidemiology, do-files Strong — do-files capture the full workflow
    SAS Proprietary Large datasets, regulated and clinical settings Strong — script-based; long industry pedigree
    SPSS Proprietary Accessible menu-driven analysis Mixed — improves greatly when syntax is saved

    Scripted analysis and reproducibility

    The single most important property for reproducibility is whether the analysis is captured as code. A script — an R script, a Python file, a Stata do-file or SAS/SPSS syntax — is an exact, re-runnable record of every transformation, model and figure. Re-running it on the same data reproduces the same results, and a reviewer can read it to see precisely what was done. Menu-driven workflows, by contrast, leave no trace of the sequence of clicks unless syntax is deliberately saved. SPSS can be fully reproducible when its underlying syntax is exported and retained, which is the practice we recommend regardless of tool.

    Script-first tools also support literate analysis, in which code, results and narrative live in one document — R Markdown and Quarto in the R and Python worlds, for example. This binds the reported numbers to the code that produced them, closing a common gap between analysis and manuscript.

    Open versus proprietary

    R and Python are free and open source, which lowers cost barriers and lets anyone inspect and re-run an analysis without a licence — a real advantage for reproducibility and for collaborators who lack institutional access. SAS, Stata and SPSS are proprietary, with validated builds, formal support and entrenched roles in regulated and clinical research. The pragmatic point is that all of these are capable, scriptable research tools; reproducibility depends less on which one you choose than on whether you script your analysis, fix your software versions and share your code.

    Citing software and reporting versions

    Software is part of the methods, and it should be reported like any other instrument. Good practice is to:

    • Name the software and version — for example the specific release of R, Stata or SAS, because behaviour and defaults change between versions.
    • List key packages and their versions — an analysis depends on its libraries as much as the base tool.
    • Cite the software using the developer’s recommended citation, and cite influential packages too.
    • Share the analysis code in a repository so the workflow is inspectable and re-runnable.

    Reporting the exact computational environment is what lets others distinguish a genuine replication failure from a version mismatch. For more on transparent methods see our reproducibility coverage, the CASRAI dictionary and our note on handling outliers, where the software’s defaults directly affect what is flagged.

    Frequently asked questions

    Which statistical software is best for research?

    There is no single best. R and Python excel for flexibility and open reproducibility; Stata is favoured in econometrics and epidemiology; SAS is entrenched in regulated and clinical settings; SPSS is approachable for menu-driven work. The reproducibility-critical choice is to script your analysis whatever the tool.

    Is open-source software acceptable for serious research?

    Yes. R and Python are mainstream research tools used across disciplines and in peer-reviewed work. Their openness is an advantage for reproducibility because anyone can inspect and re-run the code without a licence.

    Why must I report the software version?

    Defaults, algorithms and package behaviour change between releases, so the same code can give slightly different results on different versions. Reporting the version — and key package versions — lets others reproduce your environment and diagnose discrepancies.

    How should I cite the software I used?

    Use the developer’s recommended citation for the base software and cite influential packages, then share your analysis code in a repository. Our author guidance covers reporting computational methods transparently.

  • Web of Science: What It Indexes and How It Works

    Web of Science is a curated, selective citation-indexing platform operated by Clarivate that records scholarly publications and the citation links between them, enabling researchers to trace how ideas connect across the literature. Rather than indexing everything it can find, Web of Science applies editorial selection criteria, and its data underpins widely used research metrics including those published in the Journal Citation Reports.

    This article explains what the Web of Science Core Collection contains, how citation indexing works, its relationship to the Journal Impact Factor, and why a citation index is fundamentally different from a general search engine.

    The Core Collection and selective indexing

    At the heart of Web of Science is the Core Collection, a set of citation indexes covering the sciences, social sciences and arts and humanities, together with conference proceedings and book content. The defining characteristic is selectivity: journals are evaluated against editorial and quality criteria before being accepted, and coverage is curated rather than exhaustive. The intention is that the corpus represents influential, well-edited scholarly literature, so that the citation relationships drawn from it are meaningful.

    This selectivity is the central trade-off of the platform. A narrower, vetted corpus yields cleaner citation data, but it also means many legitimate outputs — particularly in regions, languages or fields with less established journals — may fall outside coverage. Understanding what is and is not indexed is essential before using the data for any kind of assessment.

    The citation index: Garfield’s idea

    The conceptual foundation of Web of Science is the citation index, an idea developed by Eugene Garfield, who founded the Institute for Scientific Information. The insight was simple but powerful: by systematically recording which papers cite which other papers, you create a navigable network of the literature. From any article you can move backwards to the references it cites and forwards to the later papers that cite it.

    This forward-and-backward navigation is what distinguishes a citation index from a bibliographic list. It lets researchers follow the development of an idea over time, identify foundational works, and gauge the influence of a paper by the citations it accrues. The same citation graph is the raw material from which bibliometric indicators are computed.

    The Journal Citation Reports and the Impact Factor

    Web of Science citation data feeds the Journal Citation Reports (JCR), Clarivate’s annual analysis of journal-level citation performance. The JCR is the source of the well-known Journal Impact Factor, a journal-level metric calculated from citation counts to a journal’s recent articles. Because the Impact Factor is derived from Web of Science data, a journal must be indexed in the relevant part of the Core Collection to receive one.

    Element What it is
    Core Collection The curated set of citation indexes underpinning the platform
    Citation index The network of citing–cited relationships between publications
    Journal Citation Reports Annual journal-level citation analysis built on the data
    Journal Impact Factor A journal-level metric published within the JCR

    It is important to stress that the Impact Factor is a journal-level measure and is widely cautioned against as a proxy for the quality of any individual article or researcher. Responsible-metrics initiatives encourage using it carefully and in context.

    How it differs from a search engine

    A general web search engine indexes pages it can crawl and ranks them by relevance and popularity signals. Web of Science is different in three respects: its corpus is selected rather than crawled; its core data structure is the citation graph rather than full-text relevance; and its records are structured bibliographic metadata — authors, affiliations, references, funding — rather than raw web content. This makes it a tool for analysis and discovery within the scholarly record, not a general-purpose finder of web pages. Related tools and systems are covered across our research information systems section.

    Web of Science is frequently compared with Elsevier’s Scopus, the other large multidisciplinary citation database; we set the two side by side in our Scopus versus Web of Science comparison. Both rely on persistent identifiers such as the DOI to link records reliably, and definitions of the metrics involved appear in the CASRAI dictionary.

    Frequently asked questions

    Is Web of Science free to use?

    No. Web of Science is a subscription product from Clarivate, typically licensed by universities, research institutions and libraries. Access depends on your organisation’s subscription.

    Does being in Web of Science mean a journal is high quality?

    Inclusion signals that a journal met the platform’s selection criteria, which is a meaningful editorial threshold. It is not, however, an absolute or universal measure of quality, and many reputable journals sit outside its coverage.

    What is the difference between Web of Science and the Journal Citation Reports?

    Web of Science is the underlying citation database; the Journal Citation Reports is an annual analytical product built from that data, and it is where the Journal Impact Factor is published.

    Who invented the citation index?

    The citation-index concept was developed by Eugene Garfield, founder of the Institute for Scientific Information, whose work established the systematic recording of citation links that Web of Science still embodies.

  • Genomic Data-Sharing Standards: GA4GH and Responsible Access Explained

    Genomic data sharing is the responsible exchange of genetic data between researchers and repositories using common standards for file formats, metadata, consent and access control. Because genetic data is sensitive and richly structured, sharing it usefully depends on agreed technical standards and clear governance rather than ad-hoc file transfers.

    This article describes how genetic and genomic data is shared from a data-standards and governance perspective. It is not clinical genetics advice; the focus throughout is notation, metadata, interoperability and access frameworks.

    The Global Alliance for Genomics and Health

    The Global Alliance for Genomics and Health (GA4GH) is an international standards organisation that develops frameworks and technical specifications to enable responsible genomic data sharing. Its work spans both governance — such as consent and data-access policy frameworks — and technical interoperability standards that allow systems to exchange genomic data and query it consistently.

    The value of a shared standards body is that institutions in different countries can align on common interfaces and metadata conventions, so a dataset described and stored according to GA4GH-aligned conventions can be discovered and accessed by authorised researchers elsewhere. Controlled vocabularies underpinning these descriptions are the kind of structured terms recorded in the CASRAI dictionary.

    FAIR principles in a genomics context

    Genomic data sharing is closely aligned with the FAIR principles: data should be findable, accessible, interoperable and reusable. In genomics, “accessible” does not mean open to everyone; it means accessible under clearly defined and machine-readable conditions, which often include authorisation and consent checks.

    FAIR principle Genomics interpretation
    Findable Datasets carry persistent identifiers and rich, searchable metadata
    Accessible Access is defined by clear, often controlled, machine-readable conditions
    Interoperable Standard formats and shared vocabularies allow systems to exchange data
    Reusable Consent terms, provenance and licensing are documented for re-analysis

    Consent, controlled access and data archives

    Much genetic data is held in controlled-access archives rather than fully open repositories. Under this model, descriptive metadata may be openly browsable while the underlying genetic data is released only to researchers whose project and credentials have been reviewed and approved by a data-access committee.

    Consent is the cornerstone of this governance. The terms under which data was originally collected determine how it may later be shared and reused, so consent metadata must travel with the data. This makes documented provenance — who collected the data, under what consent, and with what permitted uses — an essential part of responsible sharing.

    File and metadata formats

    Interoperability in genomics rests on standardised file formats for sequence reads and variants, paired with structured metadata describing the sample, the experiment and the access conditions. Consistent formats let independent groups validate, re-align and re-analyse data, supporting the goals discussed across our reproducibility coverage. Persistent identifiers tie datasets to their originating studies and contributors, as explained in our note on persistent identifiers in 2026.

    The same emphasis on stable identifiers and structured notation appears when recording protein information; see our companion guide on amino acids and protein data notation. For broader context, browse our data-infrastructure news and the guidance for authors on describing datasets.

    Frequently asked questions

    What is GA4GH?

    The Global Alliance for Genomics and Health is an international standards organisation that develops governance frameworks and technical specifications to enable responsible genomic data sharing across institutions and borders.

    Does sharing genomic data mean making it openly available to everyone?

    No. Responsible sharing usually means controlled access: descriptive metadata may be browsable, but the underlying genetic data is released only to authorised researchers whose projects and credentials have been reviewed and approved.

    How do FAIR principles apply to genetics data?

    FAIR principles require genetic data to be findable through persistent identifiers and metadata, accessible under clearly defined conditions, interoperable through standard formats, and reusable with documented consent, provenance and licensing.

    Why does consent metadata matter for data sharing?

    Consent determines the permitted uses of data. Because those terms govern future reuse, consent and provenance information must accompany the data so that downstream researchers only use it within the agreed conditions.

  • APA Reference List Format: Worked Examples

    An APA reference list is the alphabetically ordered set of full source entries placed at the end of a document, each formatted with a hanging indent and corresponding to an in-text citation. It follows the author–date conventions of the Publication Manual of the American Psychological Association (7th edition). Every work cited in the text appears once in the list, and every entry in the list is cited at least once in the text — the two must match exactly.

    The reference list is where APA’s four-element logic — author, date, title, source — becomes a precise, repeatable format. If you are new to the author–date system, start with our APA 7th edition essentials before building a full list.

    The three formatting rules that govern every entry

    Three mechanical rules apply to the whole list. First, alphabetical order by the first author’s surname; works by the same author are then ordered by year, earliest first. Second, a hanging indent: the first line of each entry sits at the left margin and every subsequent line is indented, so surnames are easy to scan. Third, the list is double-spaced with no extra blank lines between entries, and titled “References”, centred and bold, on a new page.

    Worked examples by source type

    The table below shows a correctly formatted entry for each major source type. Author names and years are illustrative placeholders, but the punctuation, italics and ordering are exactly as APA 7 requires.

    Source type Worked example
    Journal article Smith, J. A. (2021). Open-access uptake in clinical trials. Journal of Research Standards, 14(3), 220–238. https://doi.org/10.1000/jrs.2021.0143
    Book Brown, T. R. (2019). Foundations of research integrity. Academic Press.
    Chapter in an edited book Lee, S. (2020). Data-sharing norms. In R. Patel (Ed.), Open science in practice (pp. 45–67). University Press.
    Website / web page Jones, R. B. (2022, March 4). Metadata standards for research outputs. Research Standards Institute. https://example.org/metadata-standards
    Dataset Patel, A., & Khan, M. (2021). Citation-coverage survey 2021 [Data set]. Open Data Repository. https://doi.org/10.1000/odr.2021.0099

    Reading the journal-article entry

    Take the journal example apart. The author block inverts the name and uses initials. The year sits in brackets. The article title is in sentence case and not italicised — only the first word and proper nouns are capitalised. The journal name and volume number are italicised; the issue number, in brackets, is not. The page range and DOI close the entry, with no full stop after the DOI. This single pattern, with small variations, drives most of the references you will ever write.

    Handling books, chapters and the publisher rule

    Books reverse the italics: now the title is italicised in sentence case, and the publisher closes the entry. APA 7 dropped the publisher’s city, so “Academic Press” stands alone. For a chapter, you cite the chapter author and chapter title first, then “In”, the editor(s) with initials before the surname, the italicised book title, the page range in brackets, and the publisher. Knowing exactly who is credited at chapter versus volume level matters for fair attribution of credit.

    Websites, datasets and DOI formatting

    Web pages need a specific date where available — year, month and day — and the name of the hosting organisation as the “source”. Datasets are cited as first-class outputs: author, year, italicised title, a bracketed format description such as [Data set], the repository name and a DOI. Treating data this way reflects the modern research-outputs landscape, where datasets, software and protocols are citable on their own terms.

    For DOIs, always use the full https://doi.org/ form, with no trailing punctuation. If an online source has no DOI but has a stable URL, give the URL; if the content is likely to change, add a retrieval date. A persistent identifier is what links your entry to the durable scholarly record.

    Ordering edge cases

    Two situations trip people up. When one author has several works in the same year, distinguish them with lowercase letters on the year — (2021a), (2021b) — ordered by title, and mirror those letters in the in-text citations. When alphabetising, treat “nothing before something”: Smith, J. comes before Smith, J. A. Single-author entries precede multi-author entries that begin with the same surname.

    Frequently asked questions

    Should every cited source appear in the reference list?

    Yes — with one exception. Standard in-text-only items such as personal communications (emails, interviews not recoverable by a reader) are cited in the text but not listed, because there is nothing the reader can retrieve. Everything recoverable must appear.

    How do I order two works by the same author?

    By year, earliest first. If the years are identical, add lowercase letters to the year and order alphabetically by title. Single-author works always come before that author’s collaborative works.

    Do I keep the hanging indent in a numbered or bulleted list?

    The reference list is never numbered or bulleted in APA. It is a plain, double-spaced list with a hanging indent on each entry. Numbered referencing belongs to other styles, such as Vancouver.

    Where can I confirm an unusual entry?

    For conference papers, theses, software or grey literature, check your institution’s APA guide or the Publication Manual. CASRAI’s author guidance and standards dictionary can help you decide how to describe an output before you format it.

  • What Is Peer Review? Types, Process and Ethical Standards

    Peer review is the process by which scholarly work is evaluated by independent experts in the same field before it is accepted for publication. Its purpose is to assess whether a manuscript is methodologically sound, original and clearly reported, helping editors decide what to publish and helping authors improve their work. Peer review is a cornerstone of research integrity, though it is not infallible and has attracted serious reform efforts.

    At its simplest, peer review answers a question on behalf of readers who cannot check every claim themselves: have qualified experts scrutinised this work and judged it credible enough to enter the scholarly record?

    The peer review process step by step

    While details vary by journal, the core sequence is broadly consistent.

    1. Submission and editorial triage. An editor checks the manuscript for scope, basic quality and adherence to journal policy, and may desk-reject before review.
    2. Reviewer selection. The editor invites independent experts, usually two or more, who have relevant subject knowledge and no disqualifying conflict of interest.
    3. Assessment. Reviewers evaluate the methods, analysis, originality and clarity, and write structured reports with a recommendation.
    4. Decision. The editor weighs the reports and decides: accept, request minor or major revisions, or reject.
    5. Revision and iteration. Authors respond to comments, and the manuscript may go through further rounds before a final decision.

    The editor, not the reviewers, makes the final decision; reviewers advise. Authors preparing for this process can consult our guidance for authors on responding to reviewer reports constructively.

    Models of peer review

    Different journals manage the identities of authors and reviewers in different ways, with consequences for fairness and accountability.

    Model Who knows whom Main strength Main concern
    Single-blind Reviewers know authors; authors do not know reviewers Reviewers can comment candidly Possible bias against authors or institutions
    Double-blind Neither side knows the other Reduces identity-based bias Authors can sometimes be identified from the work
    Open Identities are disclosed, reports may be published Accountability and transparency Reviewers may hesitate to be critical
    Post-publication Review continues after release Ongoing scrutiny and correction Less gatekeeping before publication

    The benefits and open questions of the transparent models are examined further in our article on open peer review models.

    Ethics and COPE

    The Committee on Publication Ethics (COPE) provides widely adopted guidance for editors, reviewers and publishers on handling ethical issues such as conflicts of interest, confidentiality, authorship disputes, plagiarism and research misconduct. Reviewers are generally expected to treat manuscripts as confidential, to declare competing interests, to review only within their competence, and to avoid using privileged information for personal advantage. These norms underpin trust in the system and connect peer review to the wider concerns of our standards dictionary.

    Limitations of peer review

    Peer review is valuable but imperfect. It can be slow, inconsistent between reviewers, and vulnerable to bias relating to gender, geography or institutional prestige. It is generally better at detecting flawed reasoning than deliberate fabrication, and reproducibility problems can pass through undetected. Recognising these limits is essential to using peer review responsibly rather than treating a published paper as automatically correct.

    Reforms: registered reports and transparent review

    Several reforms aim to address these weaknesses. The registered reports format reviews the research question and methodology before data are collected, granting in-principle acceptance on the strength of the design rather than the results. This reduces publication bias against negative findings and discourages questionable research practices. Transparent peer review publishes the review reports and author responses alongside the article, allowing readers to see how conclusions were scrutinised. These approaches reflect a broader movement, discussed across our responsible assessment coverage, towards judging research on its substance and process rather than on prestige alone.

    Frequently asked questions

    Who chooses the reviewers?

    The handling editor selects reviewers, looking for relevant expertise and the absence of conflicts of interest. Authors may sometimes suggest or oppose particular reviewers, but the decision rests with the editor.

    What is the difference between single-blind and double-blind review?

    In single-blind review the reviewers know who the authors are but not vice versa. In double-blind review neither side knows the other’s identity, which is intended to reduce bias linked to author identity, institution or reputation.

    What are registered reports?

    Registered reports are a publishing format in which the study’s rationale and methods are peer reviewed before the research is carried out. If the design is sound the journal commits to publishing the results regardless of outcome, reducing publication bias.

    Does peer review guarantee a paper is correct?

    No. Peer review improves quality and filters out many weak submissions, but it cannot guarantee correctness, detect all misconduct, or ensure reproducibility. It is one safeguard among several in the scholarly record.

  • What Is a Bibliography? Definition, Types and How to Compile

    A bibliography is an organised, alphabetised list of sources relevant to a piece of scholarly work, placed at the end of a document. Depending on the convention in use, a bibliography may list only the sources cited or may also include background works consulted but not directly cited. Its purpose is to record the intellectual context of a work and let readers locate every source behind it.

    The word carries more than one meaning in scholarship. In some citation systems “bibliography” is the standard name for the end-of-document source list; in others it is distinguished sharply from a reference list. Understanding which sense applies is the first step to compiling one correctly.

    Bibliography versus reference list

    The clearest way to grasp a bibliography is to set it against the reference list it is often confused with.

    Feature Reference list Bibliography
    Contents Only sources cited in the text May include cited and uncited background reading
    Mapping to text One-to-one with in-text citations Need not map to every in-text marker
    Typical styles APA, Vancouver (as “References”) Chicago notes-bibliography, MLA (“Works Cited”)

    A reference list answers the question “what did you cite?” A bibliography can answer the broader question “what shaped this work?” The mapping between in-text markers and entries is covered in in-text citations versus the reference list.

    Types of bibliography

    Enumerative bibliography

    The most common form: a straightforward list of sources, alphabetised by author surname, each entry formatted to a chosen style. This is what most students and researchers mean by “a bibliography”.

    Annotated bibliography

    Each entry is followed by a short paragraph — the annotation — that summarises the source, evaluates its relevance or quality, and notes how it relates to the project. Annotated bibliographies are common in literature reviews and proposals, where the reader benefits from the author’s assessment of each source.

    Analytical and descriptive bibliography

    A specialist scholarly field concerned with books as physical objects — their printing, editions and material history. This sense is distinct from the everyday end-of-paper list and belongs to textual scholarship rather than routine citation.

    How to compile a bibliography

    Compiling a reliable bibliography is a disciplined, repeatable process.

    • Record sources as you read. Capture full bibliographic detail — author, year, title, container, publisher and a persistent identifier such as a DOI — at the moment you consult each source, not afterwards from memory.
    • Choose one citation style and apply it consistently. The required elements are stable, but their order and punctuation are not. See citation styles compared to select the right one.
    • Decide cited-only or cited-plus-background. Confirm whether your style and assignment want a reference list or a fuller bibliography, then include sources accordingly.
    • Alphabetise and format. Order entries by the first author’s surname and apply a hanging indent so each entry is easy to scan.
    • Verify every entry. Check that each persistent identifier resolves and that names are disambiguated — an ORCID iD helps distinguish authors with similar names.

    How to order and format entries

    Most enumerative bibliographies are ordered alphabetically by the lead author’s surname. Where an author has several works, they are usually ordered by year. Numeric systems such as Vancouver are an exception: there the list is ordered by the sequence of first appearance in the text, not alphabetically. Each entry typically uses a hanging indent, and titles, journals and books are styled per the chosen system.

    System Ordering principle
    Author–date (APA, Chicago author–date) Alphabetical by surname, then by year
    MLA Works Cited Alphabetical by first listed name or title
    Numeric (Vancouver) By order of citation in the text

    Relationship to works cited and references

    “Works Cited” is MLA’s name for its end-of-paper list and contains only cited sources, making it functionally a reference list rather than a full bibliography. Knowing the vocabulary your discipline uses prevents the common error of mixing background reading into a list that should be cited-only. Sound bibliographies also support research integrity, because a complete, accurate source list lets others verify and build on your work.

    Frequently asked questions

    How do I write a bibliography?

    Record each source’s author, title, date, publisher and a persistent identifier such as a DOI as you read; choose one citation style and apply it consistently; decide whether to include cited-only or background sources; then alphabetise entries by the lead author’s surname with a hanging indent, and verify that every identifier resolves before you submit.

    Is a bibliography the same as a reference list?

    Not always. A reference list contains only the sources you cited. A bibliography may also include background works you read but did not cite. Some styles, however, use “bibliography” as the name for what others call a reference list, so always check your style’s convention.

    What is an annotated bibliography?

    An annotated bibliography adds a short evaluative paragraph after each entry, summarising the source and explaining its relevance. It is common in literature reviews and research proposals where readers benefit from the author’s assessment of each work.

    How do I order a bibliography?

    Most bibliographies are alphabetised by the lead author’s surname, then by year for multiple works by the same author. Numeric systems such as Vancouver are the exception and order entries by their first appearance in the text.

    Where can I find standardised definitions of these terms?

    Consult the CASRAI dictionary for standardised definitions, and our explainer on what a citation is for how individual references fit together.

  • Clinical Trial Phases I to IV: Structure and Governance

    A clinical trial is a prospective study that evaluates the effects of a medical intervention — a medicine, device, procedure or behavioural change — in human participants under a pre-specified protocol. Trials are organised into phases, each answering a different question and building on the evidence of the last. This article describes the structure and governance of trials from a methodology and standards perspective; it is not clinical advice.

    The four phases

    Phase Primary question Typical scope
    Phase I Is it safe, and how is it handled by the body? First-in-human; small numbers; focus on safety, tolerability and dose-finding.
    Phase II Does it show signs of working, and at what dose? Larger groups; preliminary efficacy and further safety.
    Phase III Does it work better than current options? Large, often multi-centre randomised controlled trials supporting regulatory approval.
    Phase IV How does it perform in routine use? Post-marketing surveillance after approval; rare effects and long-term outcomes.

    How trials are designed

    The strongest designs use randomisation to allocate participants to groups, blinding to reduce expectation bias, and a control group — often a placebo or an existing standard of care — for comparison. A pre-registered protocol specifies the hypotheses, primary and secondary outcomes, sample size and analysis plan before data are collected, which guards against selective reporting. These ideas connect directly to our explainers on the placebo and placebo effect and randomised controlled trials.

    Registration and transparency

    Clinical trials are expected to be registered in a public registry — such as ClinicalTrials.gov or an ISRCTN registry — before they begin. The International Committee of Medical Journal Editors (ICMJE) requires prospective registration as a condition of publication, and the World Health Organization maintains a registry network and a minimum data set. Registration creates a public record of what a trial set out to do, so that its results can be checked against its original aims and so that unpublished trials do not vanish from the evidence base.

    Governance and ethics

    Trials are governed by independent research ethics committees (institutional review boards), by informed consent from participants, and by adherence to Good Clinical Practice. International principles trace back to the Declaration of Helsinki. Data are monitored, adverse events reported, and the conduct of the trial audited. Reporting is governed by the CONSORT statement for randomised trials, which specifies what must appear in the published account.

    Why this matters to the research record

    A registered protocol, a transparent results report and a persistent identifier together make a trial part of a citable, auditable record. The same contributor-attribution and identifier infrastructure that CASRAI works on — ORCID for people, DOIs for outputs, registries for studies — is what lets the scholarly and regulatory records stay connected.

    Frequently asked questions

    What are the four phases of a clinical trial?

    Phase I assesses safety and dose in small numbers; Phase II looks for preliminary efficacy; Phase III is a large comparison against existing options to support approval; and Phase IV monitors performance after a product is on the market.

    Why must clinical trials be registered?

    Prospective registration creates a public record of a trial’s aims and design before results are known. It deters selective reporting, reduces publication bias, and is required by the ICMJE as a condition of publishing the results.

    What is the difference between a clinical trial and clinical research?

    A clinical trial is one type of clinical research in which an intervention is tested under a protocol. Clinical research is the broader field, which also includes observational studies that do not assign an intervention.

    What governs the ethics of a clinical trial?

    Independent ethics committees, informed consent, the Declaration of Helsinki and Good Clinical Practice together govern trial ethics. See our Good Clinical Practice explainer.

  • APA Referencing Style Essentials (7th Edition)

    APA format is the author–date referencing style of the American Psychological Association, set out in the Publication Manual of the American Psychological Association (7th edition, 2020). It pairs a brief in-text citation — author surname and year — with a full, alphabetically ordered entry in a reference list at the end of the document. APA is the dominant style across psychology, education, nursing and the social and behavioural sciences.

    Because APA is built on the author–date principle, every claim that draws on a source carries a signal the reader can resolve immediately: who said it, and when. The year matters because evidence in these disciplines ages, and recency is part of how readers judge relevance. To understand where APA sits among the major systems, it helps to read it alongside our overview of how APA, MLA, Chicago and Vancouver compare.

    How APA in-text citation works

    APA in-text citations name the author and the year, and add a page or paragraph number for direct quotations. Two formats exist. A parenthetical citation places everything in brackets: (Smith, 2021). A narrative citation weaves the author into the sentence and brackets only the year: Smith (2021) argued that… For a direct quote, add a locator: (Smith, 2021, p. 14).

    Works by two authors name both every time, joined by an ampersand inside brackets — (Smith & Jones, 2020) — or by “and” in narrative form. Works by three or more authors use “et al.” from the first mention: (Smith et al., 2019). This shortening was one of the headline changes in the 7th edition.

    Anatomy of an APA reference entry

    Every full reference answers four questions in a fixed order: Who (author), When (date), What (title), and Where (source). A journal-article entry illustrates the pattern:

    Smith, J. A., & Jones, R. B. (2021). Measuring open-access uptake in clinical research. Journal of Research Standards, 14(3), 220–238. https://doi.org/10.1000/jrs.2021.0143

    Element Example Rule
    Author Smith, J. A., & Jones, R. B. Surname, then initials; invert all authors; ampersand before the last
    Date (2021). Year of publication in brackets
    Title Measuring open-access uptake in clinical research. Sentence case; article titles not italicised
    Source Journal of Research Standards, 14(3), 220–238. Journal name and volume italicised; issue in brackets; page range
    DOI https://doi.org/10.1000/jrs.2021.0143 Presented as a full clickable URL

    Authorship order in the reference list is not cosmetic — it carries credit. The conventions for who appears, and in what order, connect directly to broader debates about contribution and credit and the standards around authorship that CASRAI documents.

    Common source types

    The four-part skeleton flexes to fit different materials. A book gives author, year, italicised title in sentence case, and publisher: Brown, T. (2019). Foundations of research integrity. Academic Press. A chapter in an edited book adds the editors and book title: Lee, S. (2020). Data-sharing norms. In R. Patel (Ed.), Open science in practice (pp. 45–67). University Press. A web page gives author, date, italicised title and the site, then the URL. A dataset is treated as a recoverable output with author, year, title, a bracketed description such as [Data set], the repository, and a DOI.

    DOIs as URLs

    One of the clearest shifts in APA 7 is DOI formatting. A digital object identifier is now always presented as a full https://doi.org/ URL rather than the older “doi:” prefix. No full stop follows the DOI or URL, because trailing punctuation can break a link. When a DOI exists, include it for every source type that has one, online or print. The DOI is the source’s persistent address — closely related to the role of a stable identifier in the wider scholarly record.

    What changed in the 7th edition

    The 7th edition (2020) made several practical changes. Publisher locations were dropped from book references. The “et al.” rule now applies from the first citation for three or more authors, and the reference list may name up to 20 authors before truncating. The phrase “Retrieved from” before URLs was removed unless a retrieval date is genuinely needed. Singular “they” is endorsed as an inclusive pronoun. And the manual added explicit, format-specific guidance for student papers versus professional manuscripts.

    Frequently asked questions

    Do I need a page number for every APA citation?

    No. A page or paragraph number is required only for direct quotations and is recommended when you point to a specific passage. Paraphrased material needs author and year but no locator, though giving one is courteous when paraphrasing from a long work.

    How do I cite a source with no author?

    Move the title to the author position. For an in-text citation, use the first few words of the title in italics or quotation marks, matching how the work is formatted in the reference list, followed by the year. Use “n.d.” for no date.

    Is APA the same as Harvard referencing?

    They share the author–date family resemblance, but they are not identical. Harvard is a style family with many institutional variants, whereas APA is a single, centrally published standard with precise rules. Always follow the specific guide your publisher or institution names.

    Where can I check the correct entry for an unusual source?

    Consult the Publication Manual directly, or your institution’s APA guide, for materials such as conference papers, theses, software and social media. CASRAI’s guidance for authors and our research-standards dictionary can help you reason about how an unfamiliar output should be described and credited.

  • What Is a DOI? The Handle System and DOI Resolution Explained

    A Digital Object Identifier (DOI) is a persistent, globally unique character string that identifies a digital object — most often a journal article, dataset, book or other research output — and reliably resolves to that object’s current location on the web. Unlike a plain URL, a DOI is designed to keep working even when the underlying web address changes, because the identifier points to a record that the owner keeps up to date rather than to a fixed server path.

    DOIs are governed by ISO 26324, the international standard that defines DOI syntax and the rules of the DOI system, and they are managed at the apex by the International DOI Foundation (IDF). This article explains what a DOI is, how it is structured, how resolution works through the Handle System, and which organisations assign DOIs in scholarly publishing.

    The structure of a DOI: prefix and suffix

    Every DOI has two parts separated by a forward slash. A prefix always begins with 10. followed by a registrant code identifying the organisation that registered the DOI (for example 10.1000). A suffix, chosen by the registrant, identifies the specific item and can be any string the registrant chooses, provided it is unique within that prefix.

    Component Example Meaning
    DOI prefix 10.1000 Directory indicator (10) plus registrant code
    DOI suffix 182 Registrant-assigned identifier for the object
    Full DOI 10.1000/182 The complete, opaque identifier
    Resolvable form https://doi.org/10.1000/182 The DOI expressed as a clickable link

    The DOI itself is deliberately opaque: the characters carry no built-in meaning about the content, the publisher or the year. This opacity is a feature, not a flaw — it means a DOI never has to change because something about the object changed. The recommended way to display a DOI is as a full HTTPS link using the https://doi.org/ proxy, so that readers can simply click it.

    How DOI resolution works: the Handle System

    The technical machinery beneath every DOI is the Handle System, a distributed identifier-resolution infrastructure developed by the Corporation for National Research Initiatives. A DOI is in fact a Handle within a specific namespace, and DOI resolution is the process of looking up the identifier and returning the current data associated with it — principally the URL where the object now lives.

    When you click https://doi.org/10.1000/182, the request reaches the DOI proxy server, which queries the Handle System for that DOI’s record. The record contains the up-to-date target URL, and the resolver redirects your browser there. Because publishers update the target URL when content moves, the DOI keeps resolving even after the destination has been reorganised — this is the core of persistence.

    Persistence versus ordinary URLs

    An ordinary web link breaks — the familiar “404 Not Found” — whenever a page is moved, a domain is retired or a site is restructured. This phenomenon, known as link rot, is corrosive to the scholarly record, which depends on being able to cite and re-find sources years or decades later. A DOI mitigates this by adding a layer of indirection: citations point at the stable identifier, and the identifier’s owner maintains the mapping to wherever the content actually resides. The DOI is part of a wider family of persistent identifiers (PIDs) explored across our persistent-identifiers coverage.

    Who assigns DOIs?

    The IDF does not register individual DOIs itself; instead it appoints DOI registration agencies that serve particular communities. In scholarly publishing the two largest are Crossref, which registers DOIs for journal articles, conference papers, books and other text-based scholarly content, and DataCite, which focuses on research datasets and other non-traditional outputs. Each agency collects descriptive metadata alongside the DOI and operates the services that make DOIs useful for discovery and citation. We examine that division of labour in our piece on Crossref and the DOI registration agencies.

    DOIs also coexist with other identifiers in the modern research infrastructure — ORCID for people, ROR for organisations, RAiD for projects — described in our overview of ORCID, ROR, RAiD and the DOI in 2026. For definitions of these and related terms, the CASRAI dictionary is a useful reference.

    Versioning and DOIs

    Because a DOI is permanent, an updated version of a dataset or preprint is usually given its own DOI, with a separate “concept” DOI that always points to the latest version. This pattern is explained in our article on concept and version DOIs.

    Frequently asked questions

    Is a DOI the same as a URL?

    No. A DOI is an identifier, not a location. It is usually expressed as a URL via the https://doi.org/ proxy so it can be clicked, but the identifier itself is the part after the proxy. The URL it points to can change; the DOI does not.

    What standard defines DOIs?

    DOI syntax and the rules of the DOI system are defined by ISO 26324. The system is administered by the International DOI Foundation, and resolution is provided by the Handle System.

    Can a DOI ever stop working?

    A DOI continues to resolve as long as its registration agency and the registrant maintain the record. The persistence guarantee is a social and contractual commitment as well as a technical one: it depends on publishers updating target URLs and on the agencies remaining operational.

    How do I cite a DOI correctly?

    Best practice is to present the DOI as a full HTTPS link, for example https://doi.org/10.1000/182, so that it is both human-readable and machine-actionable. Guidance for authors is collected on our for-authors page.