Category: Guides & Explainers

Practical how-to guides, templates, checklists, and career pathways for research administrators, authors, and institutional teams.

  • The Dunning-Kruger Effect: The Study and Its Measurement Critique

    The Dunning-Kruger effect is the finding that people with low ability in a domain tend to overestimate their competence, while high performers are comparatively more accurate or even modestly underestimate themselves. It was introduced by Justin Kruger and David Dunning in a 1999 paper. The effect is widely cited, but it is also the subject of serious measurement-science debate about how much of the pattern reflects a real metacognitive deficit versus a statistical artefact.

    What the original study showed

    Kruger and Dunning (1999) tested participants on tasks such as logical reasoning, grammar and judging humour, then asked them to estimate both their raw score and their percentile rank relative to peers. Plotting self-assessment against actual performance produced the now-famous picture: those in the bottom quartile rated themselves far above their true standing, while top performers gave more modest estimates. The authors interpreted this as a problem of metacognition—the same lack of skill that produces poor performance also impairs the ability to recognise that performance is poor.

    How the effect is usually visualised

    Performance group Typical actual percentile Typical self-estimate
    Bottom quartile Low Substantially above actual
    Middle quartiles Moderate Near actual, mildly inflated
    Top quartile High Close to or slightly below actual

    The gap between the self-estimate column and the actual column is what the popular account calls the effect.

    The regression-to-the-mean critique

    The most important methodological objection is regression to the mean. Self-assessments are imperfect and noisy. Whenever two variables are imperfectly correlated, extreme scores on one tend to be paired with less extreme scores on the other. So the lowest performers, simply by being extreme, will on average have self-estimates closer to the middle—looking like overestimation—while the highest performers’ estimates regress downward, looking like underestimation. Critics argue that part of the classic graph would appear even if everyone judged themselves with the same modest, unbiased error.

    The better-than-average effect

    A second contributor is the better-than-average effect: across many domains most people rate themselves as above the median. If nearly everyone places themselves near, say, the 60th–70th percentile regardless of skill, then by arithmetic the genuinely low performers must be overestimating and the genuinely high performers underestimating. Some of the Dunning-Kruger pattern can therefore be reconstructed from a general self-enhancement tendency plus the statistics of ranking.

    The double-burden hypothesis

    Kruger and Dunning’s psychological explanation was a “double burden”: the competences required to do well on a task are often the same competences required to judge one’s own performance on it. A person with a weak grasp of grammar, for instance, lacks the knowledge to spot their own grammatical errors, so they cannot accurately rate their grammar. On this account, incompetence is doubly costly—it produces poor results and conceals them from the performer. The original studies offered some support by showing that training low performers improved both their skill and their self-assessment, which a purely statistical account does not obviously predict. The authors also noted an apparent asymmetry: top performers tended to underestimate their relative standing, which they attributed to a “false-consensus” assumption that tasks they found easy were easy for everyone. Whether that asymmetry is a genuine psychological phenomenon or a further reflection of the statistics of ranking remains part of the ongoing debate.

    Why the measurement debate matters

    The Dunning-Kruger discussion is valued in methodology teaching precisely because it shows how a robust-looking pattern can have multiple explanations. The same graph is consistent with a genuine metacognitive deficit, with regression to the mean, and with the better-than-average effect—and disentangling them requires careful design rather than eyeballing a chart. Analyses that simulate purely random noise can reproduce a strikingly similar figure, which is sobering. Yet other work that models the components separately still finds a residual effect that noise alone cannot account for. The honest position is that the strong, viral version is overstated while a weaker, real phenomenon may remain, and that the size of any genuine effect should be reported with its uncertainty rather than asserted as a fixed law.

    A nuanced reading of the evidence

    These critiques do not show that the effect is purely an illusion. Researchers continue to debate how much residual metacognitive deficit remains after accounting for regression and the better-than-average tendency, and some analyses find a real, if smaller, component. The responsible conclusion is conditional: the headline graph overstates a clean psychological law, yet the underlying observation—that the unskilled often lack the very knowledge needed to gauge their own gaps—retains some support. This is a textbook example of why effects must be evaluated against reproducibility standards, and against the reliability and validity of the self-report measures involved, rather than accepted from a memorable chart alone.

    How the popular version diverges from the science

    The phrase “Dunning-Kruger effect” has taken on a life of its own online, often shrunk to the claim that “stupid people are too stupid to know they are stupid” or paired with an invented graph showing a confident “peak” early in learning that the original papers never reported. Neither caricature reflects the published research. Kruger and Dunning’s data were about average tendencies across performance quartiles, not a universal law applying to every individual, and they did not describe a confidence curve rising and falling with expertise. This gap between the meme and the method is itself instructive: a finding can become more certain in popular retelling even as the scientific picture grows more cautious. Treating the viral version as established fact is exactly the kind of error that careful sourcing and clear definitions, such as those maintained in a research dictionary, are meant to prevent.

    Why this matters for assessment

    The episode is a caution for anyone relying on self-rated competence. Self-assessment is a weak proxy for ability, and instruments that ask people to rank themselves inherit the same statistical traps. Sound responsible assessment pairs self-report with objective measures and reports their reliability and validity, rather than treating a vivid effect as settled fact. Clear terminology, as catalogued in a research dictionary, helps prevent a contested finding from hardening into a slogan.

    Frequently asked questions

    Do high performers underestimate their abilities under the Dunning-Kruger effect?

    Yes, modestly. In Kruger and Dunning’s 1999 study, participants in the top performance quartile rated their own rank slightly below their true standing, while bottom-quartile participants substantially overestimated theirs. Critics note that regression to the mean and the better-than-average effect can reproduce part of this pattern without any genuine metacognitive deficit.

    What is the Dunning-Kruger effect in simple terms?

    It is the tendency for people with low ability in an area to overestimate their competence, partly because the skills needed to perform well are also needed to recognise poor performance.

    Who discovered the Dunning-Kruger effect?

    Justin Kruger and David Dunning described it in a 1999 paper based on experiments in reasoning, grammar and humour, where participants estimated their own rank against peers.

    Is the Dunning-Kruger effect real or a statistical artefact?

    It is partly contested. Regression to the mean and the better-than-average effect can reproduce much of the famous graph, so the strong version is overstated, though some researchers still find a residual metacognitive component.

    What is regression to the mean?

    When two variables are imperfectly correlated, extreme scores on one tend to pair with less extreme scores on the other. This alone can make low scorers look like overestimators and high scorers like underestimators.

  • What Is Research? Meaning, Types and the Research Lifecycle

    Research is a systematic process of investigation undertaken to discover new knowledge, confirm or revise existing understanding, and answer questions that have not yet been adequately resolved. What separates research from casual enquiry is its systematic character: it follows a planned, transparent method, gathers evidence deliberately, and subjects its conclusions to scrutiny. The result is intended to be reliable knowledge that others can examine, build upon and, ideally, reproduce.

    Basic and applied research

    Research is commonly divided into two broad orientations. Basic research, sometimes called fundamental or pure research, seeks to expand understanding for its own sake, without a specific application in mind. Investigating how a protein folds or why a mathematical relationship holds is basic research. Applied research addresses a particular practical problem: developing a treatment, improving a manufacturing process or evaluating a policy. The two are not rivals but a continuum, and basic findings frequently enable later applied advances.

    Type Primary aim Example
    Basic research Advance fundamental understanding Studying the mechanism of cell division
    Applied research Solve a defined practical problem Testing a new drug to prevent a disease

    The research lifecycle

    Most research, whatever its field, moves through a recognisable sequence of stages often described as the research lifecycle. While disciplines differ in detail, the lifecycle gives a shared vocabulary for the work and for the data and outputs it produces.

    Stage What happens
    Question Identify a gap and frame a clear, answerable research question or hypothesis
    Design Choose methods, plan sampling and analysis, address ethics and feasibility
    Data Collect, manage and document data according to the plan
    Analysis Interpret the evidence using appropriate methods and statistics
    Dissemination Report findings through publications, datasets and other shared outputs

    The lifecycle is iterative rather than strictly linear. Analysis often raises new questions, and dissemination feeds the next cycle of enquiry. Crucially, each stage generates information, about methods, samples, instruments and results, that needs to be described consistently so others can understand and reuse it.

    Analysis and the role of statistics

    The analysis stage is where evidence becomes findings. In quantitative research this typically draws on statistics, using descriptive summaries to characterise data and inferential methods to generalise responsibly. Careful analysis distinguishes signal from noise, reports uncertainty honestly through measures such as confidence intervals, and resists over-interpreting chance patterns. Weak analysis is a recognised threat to the trustworthiness of the resulting knowledge.

    Reproducibility and the scholarly record

    Research only contributes durable knowledge if its claims can be checked. Reproducibility, the ability of others to obtain consistent results using the same data and methods, depends on transparent reporting of every lifecycle stage. The scholarly record, the accumulated and citable body of publications, datasets and metadata, is the lasting product of research. CASRAI’s mission is to standardise the terminology used to describe research activities and outputs, which directly supports clearer reporting and reuse. Explore the CASRAI dictionary, the research lifecycle category and the author guidance for related resources.

    Frequently asked questions

    What makes an activity count as research?

    Research is distinguished by being systematic, methodical and aimed at producing generalisable or transferable knowledge. A planned investigation with documented methods and conclusions open to scrutiny qualifies; an unstructured opinion does not.

    Is the research lifecycle the same in every field?

    The broad stages, question, design, data, analysis and dissemination, are common across disciplines, but the methods within each stage vary widely. A laboratory experiment, a clinical trial and an archival history study share the lifecycle shape while differing in technique.

    How does CASRAI relate to research?

    CASRAI develops shared, standardised vocabularies for describing the people, activities and outputs of research. Consistent terminology across the lifecycle makes outputs easier to find, compare, reuse and reproduce, strengthening the scholarly record as a whole.

  • Cohort and Case-Control Study Designs

    A cohort study follows groups defined by their exposure forward to see who develops an outcome, while a case-control study starts from the outcome and looks back at exposure. Both are observational designs — the researcher observes rather than assigns exposure — and each answers a question the other cannot answer efficiently.

    This methodology guide explains the two designs, their strengths and weaknesses, and the high-level difference between relative risk and the odds ratio. It is methodological in scope and not medical advice.

    Cohort studies: exposure first, outcome later

    A cohort study groups participants by exposure status and tracks them over time to compare how often the outcome occurs in each group. Two timings exist:

    • Prospective — exposure is recorded now and the cohort is followed forward into the future.
    • Retrospective — the researcher uses existing records to reconstruct exposure in the past and trace outcomes that have already occurred.

    Because exposure is established before the outcome is known, cohort designs are well suited to establishing temporal sequence and to studying multiple outcomes from a single exposure.

    Case-control studies: outcome first, exposure looked back

    A case-control study begins with the outcome: it assembles cases (those with the condition) and controls (comparable individuals without it), then looks back to compare how often each group was exposed. This makes case-control designs efficient for rare outcomes and for situations where a long follow-up would be impractical.

    Side-by-side comparison

    Feature Cohort Case-control
    Starting point Exposure Outcome
    Direction Exposure → outcome Outcome → exposure (look back)
    Good for rare outcomes Inefficient Efficient
    Good for rare exposures Efficient Inefficient
    Multiple outcomes Yes, from one exposure No, single outcome
    Headline measure Relative risk Odds ratio
    Main weakness Cost, loss to follow-up Recall and selection bias

    Relative risk versus odds ratio, conceptually

    The two designs naturally produce different effect measures. A cohort study can compute relative risk — the ratio of the probability of the outcome in the exposed group to that in the unexposed group — because it knows how many people in each group went on to develop the outcome. A case-control study cannot compute that directly, because the researcher chose how many cases and controls to recruit; it instead reports the odds ratio, which compares the odds of exposure between cases and controls. When the outcome is rare, the odds ratio approximates the relative risk closely; as the outcome becomes common, the two diverge. This is a conceptual sketch, not a formula to apply clinically.

    Strengths, weaknesses and bias

    Cohort studies give clear temporal ordering and can study several outcomes, but they are expensive, slow for rare outcomes, and vulnerable to participants dropping out. Case-control studies are quick and efficient for rare outcomes, but are prone to recall bias (cases may remember exposures differently) and to selection bias in how controls are chosen. Neither design assigns exposure, so unmeasured confounding is always a concern — a recurring theme across the research lifecycle.

    STROBE reporting

    Both designs are reported against the STROBE guideline (Strengthening the Reporting of Observational Studies in Epidemiology), a checklist covering how participants were selected, how variables were measured, how bias was addressed and how results were analysed. Transparent reporting lets readers judge validity — the same transparency goal behind structured abstracts, covered in our guide to how to write a research abstract, and the IMRaD structure in the anatomy of a journal article.

    How design choice fits the research record

    Naming a design precisely is part of describing a study well. Controlled terminology in our dictionary and contributor roles via CRediT make that description machine-readable, while our for authors guidance helps report methods clearly.

    Frequently asked questions

    Is a retrospective cohort the same as a case-control study?

    No. A retrospective cohort still groups by exposure and follows toward outcome, using past records; a case-control study groups by outcome and looks back at exposure. The starting point differs.

    Why can’t a case-control study report relative risk?

    Because the researcher sets the number of cases and controls, the underlying population rates of the outcome are unknown, so the odds ratio is used instead.

    Which design is stronger?

    Neither universally. Cohort designs suit common outcomes and temporal questions; case-control designs suit rare outcomes and efficiency. The research question decides.

    What is STROBE for?

    It is a reporting checklist that improves the completeness and transparency of observational studies, helping readers assess potential bias and the strength of the evidence.

  • DISC and Personality Assessment as Measurement Science

    The DISC test is a self-report behavioural assessment that profiles people across four dimensions: Dominance, Influence, Steadiness and Conscientiousness. Widely used in workplace training and team-building, DISC traces to the work of psychologist William Moulton Marston in the 1920s. Like any personality instrument, its usefulness depends not on popularity but on how well it satisfies measurement-science criteria—reliability, validity and fitness for the purpose to which it is applied.

    Origins and the four dimensions

    Marston proposed a model of normal human emotions and behaviour built around two axes—how a person perceives their environment (favourable or antagonistic) and how active or passive they feel in it. Later authors operationalised these ideas into questionnaires that yield the familiar DISC profile.

    Dimension Behavioural emphasis
    Dominance (D) Directness, control, results focus
    Influence (I) Sociability, persuasion, enthusiasm
    Steadiness (S) Patience, cooperation, stability
    Conscientiousness (C) Accuracy, structure, attention to detail

    Marston was a theorist of emotion, not of psychometric test construction; he did not design DISC as a rigorously validated assessment, and modern commercial versions vary in quality. It is also worth noting that the four labels describe behavioural styles—how a person tends to act in a given context—rather than fixed, deep-seated traits. Behaviour is partly situational, so a profile captured at one moment in one setting may not generalise to another, a point that should temper any strong claims drawn from a single administration.

    How such instruments are evaluated

    Any personality measure should be judged on the standard psychometric criteria. Reliability covers consistency: test-retest stability, internal consistency (often summarised by Cronbach’s alpha) and, where raters are involved, inter-rater agreement. Validity covers meaning: construct validity (does the test measure the trait it claims?), content validity (do the items sample the domain?) and criterion validity (does the score predict relevant outcomes?). A reputable instrument publishes these properties; a marketing brochure that omits them is a warning sign.

    Norms, fairness and interpretation

    A frequently overlooked requirement of any assessment used with people is a defensible set of norms—the reference data against which an individual’s score is interpreted. A raw DISC profile means little without knowing the population it is compared against; a score that looks “high” relative to one norm group may be average against another. Responsible use therefore depends on the publisher documenting who the norm sample was, how large it was and when it was collected, and on practitioners checking that those norms are appropriate for the people being assessed. Where norms are outdated, unrepresentative or undisclosed, interpretations risk being unfair, particularly if results feed into decisions about individuals. These fairness considerations are part of why such tools are better confined to development than to selection.

    Strengths and limitations of DISC

    DISC’s appeal is its simplicity and a shared vocabulary for discussing communication styles. Used as a facilitation aid, it can prompt useful reflection and dialogue. Its limitations are also clear. The model focuses on observable behavioural style rather than the broad trait structure recovered in academic research, and independent peer-reviewed validation is thinner than for established inventories. Where rigorous prediction is required, researchers more often turn to dimensional models such as the Big Five, which have stronger published reliability and validity—a contrast also seen in critiques of the Myers-Briggs Type Indicator.

    Appropriate versus inappropriate uses

    The measurement-science verdict on DISC is best stated as a matter of fit:

    • Appropriate: stimulating self-awareness, opening conversations about working styles, and structuring team-building discussions where no high-stakes decision rides on the result.
    • Inappropriate: selecting or rejecting job candidates, denying promotions, or making any consequential judgement about a person, especially when the specific version’s predictive validity is undocumented.

    This distinction is central to responsible assessment: a tool can be valuable for development and yet wholly unsuitable for selection. Treating a developmental instrument as a gatekeeping test imports risks of unfairness and poor decisions.

    From Marston’s theory to commercial instruments

    It is worth separating the model from the products built on it. Marston set out his ideas in his 1928 book on the emotions of normal people, describing behavioural tendencies along his two axes. He did not, however, publish a validated assessment. The questionnaires sold today were developed later by various authors and publishers, who differ in their item construction, scoring and norming. As a result, “DISC” is not a single standardised test but a family of instruments of varying quality, and a positive evaluation of one product does not transfer to another that merely shares the name. A measurement-science evaluation must therefore target the specific version in use, asking for its technical manual, sample sizes, reliability coefficients and validity studies rather than accepting the brand at face value.

    Ipsative scoring and its consequences

    Many DISC instruments use a forced-choice, or ipsative, format in which respondents rank options against one another rather than rating each independently. Ipsative scoring has a known measurement drawback: because raising one score necessarily lowers others, the dimensions are not independent, which complicates comparisons between people and can distort the apparent profile. This is a technical reason that some DISC products are better suited to within-person reflection (“which of my tendencies is strongest?”) than to between-person ranking (“is candidate A more dominant than candidate B?”). Recognising the scoring model is part of judging whether a tool fits the intended use.

    A checklist for evaluating any personality tool

    The questions that should be asked of DISC apply to every commercial assessment: Is there an independent, peer-reviewed evidence base, or only publisher materials? Are reliability and validity coefficients published and adequate? Is the scoring normative or ipsative, and does that suit the purpose? Is the instrument being used for development or for a high-stakes decision? Applying this checklist consistently is what separates responsible assessment from the uncritical adoption of whichever tool is most heavily marketed.

    Reporting and transparency

    When personality data informs research or organisational practice, the instrument, its version and its psychometric evidence should be reported plainly so others can judge the result. This transparency mirrors the wider push for clear, reusable terminology recorded in a controlled research dictionary, and it gives authors a defensible basis for the claims they make. Without it, scores carry an unearned air of precision.

    Frequently asked questions

    What does DISC stand for?

    DISC stands for Dominance, Influence, Steadiness and Conscientiousness—four behavioural dimensions used to describe a person’s typical working and communication style.

    Who created the DISC model?

    The underlying theory comes from psychologist William Moulton Marston in the 1920s. Later authors and commercial publishers turned his ideas into the questionnaires now marketed as DISC assessments.

    Is DISC scientifically valid?

    It depends on the specific version. DISC is useful as a development and communication aid, but independent validity evidence is variable, so it should not be used for hiring or other high-stakes decisions.

    How should organisations use DISC responsibly?

    Limit it to self-awareness and team discussion, avoid using it to screen candidates, and report the instrument and its psychometric properties so results can be judged on their evidence.

  • P-Values and Statistical Significance Explained Correctly

    A p-value is the probability of obtaining a result at least as extreme as the one observed, assuming that the null hypothesis is true. It is a measure of how compatible the data are with a specified statistical model in which there is no effect or no difference. A small p-value indicates that the observed data would be unusual if the null hypothesis held; it does not, by itself, prove that the null hypothesis is false or that an effect is real or important.

    What the null hypothesis represents

    Hypothesis testing begins with a null hypothesis, typically a statement of no effect, no difference or no association. The test asks how surprising the observed data would be if that null hypothesis were true. The p-value quantifies that surprise: the smaller it is, the less compatible the data are with the null model. Critically, the p-value is calculated under the assumption that the null is true, which is why it cannot be read as the probability that the null is true.

    The American Statistical Association’s 2016 statement

    In 2016 the American Statistical Association (ASA) published a formal statement on p-values, the first time it had issued such guidance, in response to widespread misuse. The statement set out six principles. In summary, it affirmed that p-values can indicate how incompatible data are with a specified model, but warned that a p-value does not measure the probability that the hypothesis under study is true, nor the probability that the data arose by chance alone. It cautioned that scientific conclusions should not be based only on whether a p-value passes a threshold, that proper reporting requires full transparency, that a p-value does not measure the size or importance of an effect, and that by itself a p-value is a poor measure of evidence regarding a model or hypothesis.

    Common misinterpretations

    Several persistent errors surround p-values. Avoiding them is essential for sound, reproducible reporting.

    Misinterpretation Why it is wrong
    The p-value is the probability the null hypothesis is true It is calculated assuming the null is true; it cannot also be that probability
    p = 0.05 means a 5% chance the result is a fluke The p-value is not the probability that the finding is due to chance
    A non-significant result proves no effect exists Absence of significance is not evidence of absence; the study may simply lack power
    A small p-value means a large or important effect The p-value reflects compatibility and sample size, not effect magnitude

    The limits of the 0.05 convention

    The threshold of 0.05 for declaring statistical significance is a convention, not a law of nature. Treating 0.05 as a bright line encourages dichotomous thinking in which a result at p = 0.049 is celebrated and one at p = 0.051 dismissed, despite negligible difference between them. This convention has fed practices such as selective reporting and p-hacking, where analyses are adjusted until a result crosses the threshold, both serious threats to reproducibility. The ASA statement explicitly warned against basing conclusions solely on whether a p-value clears a cut-off.

    Effect sizes and intervals

    Because a p-value says nothing about magnitude, it should be accompanied by an effect size, which describes how large the observed effect is, and ideally a confidence interval, which expresses the precision of the estimate. Reporting these alongside, or instead of, a bare p-value gives readers far more information for judging whether a finding matters. The underpinning ideas come from the wider discipline of statistics, and transparent reporting of all of them supports the goals tracked in our reproducibility category. For terminology and reporting conventions, consult the CASRAI dictionary.

    Frequently asked questions

    Does a p-value below 0.05 prove an effect is real?

    No. It indicates the data would be unusual if the null hypothesis were true, but it does not prove the null is false, nor that the effect is large or important. Replication, effect sizes and intervals are needed to judge that.

    What did the ASA 2016 statement conclude?

    The statement set out six principles emphasising that p-values measure compatibility with a model, are not the probability the hypothesis is true, do not measure effect size, and should never be the sole basis for scientific conclusions. It urged full transparency in reporting.

    Should we abandon p-values altogether?

    Not necessarily. P-values can be informative when interpreted correctly and reported alongside effect sizes and confidence intervals. The problem lies in misuse and over-reliance on a single threshold, not in the statistic itself. See the CASRAI author guidance for reporting practices.

  • T-Tests Explained: Comparing Two Means

    A t-test is a statistical test that assesses whether the difference between two means is larger than would be expected by chance alone. It compares the size of an observed difference against the variability in the data, producing a t-statistic that can be converted into a p-value. The t-test is one of the most common tools for comparing groups in research.

    How a t-test works

    The t-statistic is essentially the difference between means divided by the standard error of that difference. A large t-statistic indicates that the difference is large relative to the spread of the data, making it less likely to have arisen by chance. The t-statistic is then evaluated against the t-distribution, which resembles the normal distribution but has heavier tails to account for the extra uncertainty in small samples.

    The three types of t-test

    There are three principal forms of the t-test, each suited to a particular comparison.

    Type What it compares Typical use
    One-sample A sample mean against a known or hypothesised value Testing whether a mean differs from a reference standard
    Independent-samples The means of two separate, unrelated groups Comparing a treatment group with a control group
    Paired Two measurements from the same subjects Before-and-after measurements on the same participants

    Choosing the right type is essential. Using an independent-samples test on paired data, for instance, ignores the correlation between the two measurements and usually reduces the power of the analysis.

    Assumptions of the t-test

    The validity of a t-test rests on several assumptions. The data should be approximately normally distributed, particularly in small samples, although the central limit theorem makes the test fairly robust at larger sample sizes. Observations should be independent, except in the paired test where the pairing is deliberate. For the independent-samples test, the two groups are traditionally assumed to have equal variances; when this assumption is doubtful, Welch’s t-test, which does not require equal variances, is a safer default. Outliers can distort the result and should be inspected beforehand.

    Relationship to p-values and significance

    The t-test does not by itself prove that two groups differ; it quantifies the evidence against the null hypothesis that the means are equal. The resulting p-value is the probability of observing a difference at least as large as the one found, assuming the null hypothesis is true. A small p-value, conventionally below 0.05, suggests the difference is statistically significant, but it says nothing about the size or practical importance of the effect. Reporting the mean difference and a confidence interval alongside the p-value gives a fuller picture.

    Reporting t-tests transparently

    Good practice is to report the type of t-test used, the t-statistic, the degrees of freedom, the p-value, the effect size and a confidence interval. Stating which test was chosen and why, and confirming that its assumptions were checked, supports the reproducibility goals described in the CASRAI dictionary and our guidance for authors. An adequately powered design, discussed in our piece on statistical power, is equally important.

    Frequently asked questions

    When should I use a t-test rather than ANOVA?

    Use a t-test to compare two means. When you need to compare three or more group means simultaneously, analysis of variance (ANOVA) is the appropriate extension, as running multiple t-tests inflates the chance of a false positive.

    What if my data are not normally distributed?

    For small, clearly non-normal samples, consider a non-parametric alternative such as the Mann-Whitney U test for independent groups or the Wilcoxon signed-rank test for paired data.

    What is the difference between a one-tailed and two-tailed t-test?

    A two-tailed test detects a difference in either direction and is the default. A one-tailed test only looks for a difference in one specified direction and should be used only when justified in advance.

  • Sample Size and Statistical Power Explained

    Statistical power is the probability that a study will correctly detect an effect when one truly exists. It is formally defined as one minus the Type II error rate, written as power = 1 − β. A study with high power is likely to find a real effect; an underpowered study may miss it, producing a false negative. Power is closely tied to sample size, which is why power analysis is a core part of study planning.

    Type I and Type II errors

    Hypothesis testing can go wrong in two ways. A Type I error, with probability α, occurs when the test detects an effect that is not really there, a false positive. A Type II error, with probability β, occurs when the test fails to detect an effect that is genuinely present, a false negative.

    Effect truly exists No effect exists
    Test is significant Correct (power = 1 − β) Type I error (α)
    Test is not significant Type II error (β) Correct

    The significance threshold α is usually set at 0.05, which links directly to the interpretation of p-values and significance testing.

    The 0.8 convention

    By widespread convention, researchers aim for a power of at least 0.8, meaning the study has an 80% chance of detecting the effect of interest if it exists. This corresponds to a Type II error rate of 0.2. The figure is a pragmatic standard rather than a law: some fields demand higher power, such as 0.9, particularly when missing an effect would be costly. The key point is to choose and justify a target before data collection.

    What determines power?

    Four quantities are linked: the sample size, the effect size, the significance level α and the power. Fixing any three determines the fourth. Power increases with a larger sample size, a larger true effect, a less stringent α and lower data variance. Because researchers usually cannot change the effect size or the desired α, the practical lever is the sample size.

    A priori power analysis

    An a priori power analysis is performed before data collection to determine the sample size needed to achieve the desired power for a plausible effect size. Researchers specify the target power (often 0.8), the significance level (often 0.05) and the smallest effect size they consider meaningful, then calculate the required number of participants. This prevents the common mistake of recruiting too few subjects, and is increasingly expected by funders, ethics committees and journals. The same logic applies whether the planned analysis is a t-test, a regression or another test.

    Why underpowered studies harm reproducibility

    Underpowered studies are a major threat to reproducibility. They frequently miss real effects, and when they do reach significance the estimated effect is often exaggerated, a phenomenon known as the winner’s curse. Such inflated estimates fail to replicate in larger studies. Conducting and reporting a power analysis, and pre-specifying the sample size, makes research more credible. The CASRAI dictionary and our author guidance encourage transparent reporting of these design choices, ideally alongside a confidence interval that conveys the precision of the estimate.

    Frequently asked questions

    What is a good level of statistical power?

    A power of 0.8 is the common minimum, giving an 80% chance of detecting a true effect. Higher targets such as 0.9 are preferable when feasible, especially for confirmatory studies.

    Can I calculate power after the study is finished?

    Post-hoc power calculated from the observed effect is generally uninformative, because it is just a restatement of the p-value. Power analysis is most useful when done in advance to plan sample size.

    What is the relationship between sample size and power?

    Larger samples increase power because they reduce the standard error, making real effects easier to detect. This is the main reason a priori power analysis focuses on choosing an adequate sample size.

  • What Is Generative AI and Research Disclosure Norms?

    Generative AI refers to machine-learning systems that produce new content, such as text, images, audio or code, by modelling the patterns of their training data and sampling from them. Unlike predictive models that output a label or a number, a generative model outputs an artefact. The most prominent examples are large language models (LLMs) for text and diffusion models for images. For research, the rise of these tools has prompted clear disclosure norms from editorial bodies, the most important being that AI cannot be listed as an author.

    What generative AI is

    Modern generative systems are typically foundation models: large models trained on broad data at scale, then adapted to many downstream tasks. Large language models are built on the transformer architecture introduced in 2017, which uses an attention mechanism to weigh relationships between tokens in a sequence and predict the next token. Diffusion models generate images by learning to reverse a gradual noising process, starting from random noise and denoising it step by step into a coherent image. The underlying machinery is the neural network described in our explainer on neural networks and deep learning.

    How generative AI differs from predictive ML

    The distinction is one of output. Predictive (discriminative) machine learning answers questions about given inputs: is this email spam, what is this house worth, which category does this image belong to? Generative AI instead produces novel outputs that did not exist before. A useful framing is that predictive models estimate a label given an input, whereas generative models estimate the distribution of the data itself and sample new examples from it. The foundations of the predictive paradigm are covered in our guide to machine learning concepts and methods.

    Aspect Predictive ML Generative AI
    Typical output Label, score or value New text, image, audio or code
    Goal Predict a target for an input Produce novel content
    Examples Spam filter, price regression LLMs, diffusion image models

    Emerging research-disclosure norms

    As researchers began using generative tools to draft, edit and analyse, journals and editorial bodies responded with guidance. Two positions are now widely shared across the scholarly publishing ecosystem.

    AI cannot be an author. The International Committee of Medical Journal Editors (ICMJE) and the Committee on Publication Ethics (COPE) hold that authorship entails responsibility and accountability that a non-human tool cannot bear, including approving the final version and being answerable for the integrity of the work. A generative model therefore cannot meet authorship criteria and must not be listed as an author or co-author.

    Use must be disclosed. Where generative AI has been used in producing a manuscript, authors are expected to disclose how it was used, typically in the methods or acknowledgements, so that reviewers and readers can assess it. Authors remain fully responsible for the accuracy and integrity of everything in the submission, including any AI-assisted content. These norms are tracked across our GenAI disclosure coverage, and they extend to confidential contexts such as peer review, as set out in our policy on generative AI in peer review, disclosure and confidentiality.

    Documenting generative-AI use in the research record

    Good disclosure is specific. Stating which tool was used, for what purpose (for example language editing versus drafting analysis), and what human verification followed, makes the record auditable. This dovetails with structured documentation practices such as model cards and datasheets, discussed in our piece on AI model documentation, and with the controlled vocabulary maintained in the casrai.org research dictionary.

    Frequently asked questions

    Can generative AI be listed as an author on a paper?

    No. ICMJE and COPE positions hold that authorship requires accountability for the work that a non-human tool cannot bear. Generative AI cannot be an author or co-author, and its use should instead be disclosed.

    How is generative AI different from predictive machine learning?

    Predictive ML outputs a label, score or value for a given input, while generative AI produces new content such as text or images. Generative models learn the distribution of the data and sample from it.

    Where should authors disclose generative-AI use?

    Typically in the methods or acknowledgements, stating which tool was used and for what purpose. Authors remain fully responsible for the accuracy and integrity of all AI-assisted content.

    What is a foundation model?

    A foundation model is a large model trained on broad data at scale and then adapted to many downstream tasks. Large language models and diffusion image models are common examples.

  • Time-Domain Spectroscopy: Principles and Applications

    Time-domain spectroscopy, most commonly in the terahertz range, is a measurement technique that records the full electric-field waveform of an ultrashort light pulse as a function of time, rather than measuring intensity at each frequency separately. Because the technique captures the field directly, a single Fourier transform converts the time-domain trace into both an amplitude spectrum and a phase spectrum. This article explains how the measurement is produced and where it is used in materials research. It is a methods explainer about instrumentation and signal analysis.

    What makes it a time-domain technique

    Conventional spectroscopy measures how much light of each frequency a sample transmits or absorbs, producing a spectrum directly. Time-domain spectroscopy works differently. It launches a single, extremely short burst of electromagnetic radiation, a pulse lasting on the order of picoseconds or less, and records the shape of that pulse, its electric field rising and falling, as it evolves in time. The measured object is a waveform, a plot of field strength against time, not a spectrum. The spectrum is obtained afterwards by computation.

    The key enabling fact is that the pulse is so short that it contains a broad band of frequencies simultaneously. A pulse confined to a tiny window in time is necessarily spread across a wide range in frequency, a direct consequence of the Fourier relationship between time and frequency. Recording one waveform therefore samples the whole band at once.

    Generating and detecting an ultrashort pulse

    The pulses are produced from a femtosecond laser whose ultrashort optical pulse drives a photoconductive emitter or a nonlinear crystal, converting the optical pulse into a terahertz pulse. Detection is the elegant part. To measure a field that oscillates far too fast for any conventional detector to follow, the system uses a sampling scheme. A portion of the same femtosecond laser pulse is split off as a gate, and a variable optical delay line changes the path length, and hence the arrival time, of this gate by tiny, precise increments. At each delay setting the detector reads the terahertz field at that instant. Stepping the delay across the pulse traces out the entire waveform point by point, much as a sampling oscilloscope reconstructs a fast repeating signal.

    From waveform to spectrum: the Fourier transform

    Once the time-domain waveform is recorded, a Fourier transform decomposes it into its constituent frequencies. Crucially, because the technique measures the field rather than the intensity, the transform yields complex values, giving both the amplitude and the phase at every frequency. This is a notable advantage over intensity-only methods, where phase information is lost and must be inferred.

    Domain What is measured or derived How it is obtained
    Time domain Electric-field waveform versus time Delay-line sampling of the pulse
    Frequency domain Amplitude spectrum Magnitude of the Fourier transform
    Frequency domain Phase spectrum Phase of the Fourier transform

    To characterise a sample, the experimenter records two waveforms: a reference with no sample in the beam, and one with the sample present. Comparing the two transforms gives the frequency-dependent change in amplitude and phase, from which optical constants such as the refractive index and absorption coefficient are computed. This dual recording and transform shares its logic with the Fourier reconstruction used in MRI.

    Research applications in materials

    Because the terahertz band sits between microwaves and infrared, it probes low-energy excitations that other techniques miss: lattice vibrations in crystals, the dynamics of charge carriers in semiconductors, and weak intermolecular modes in molecular solids. Researchers use the technique to measure the conductivity of thin films without contacts, to identify crystalline forms of a compound by their characteristic absorption features, and to study the dielectric response of polymers and composites. The direct access to phase makes it well suited to measuring thickness and refractive index of layered materials.

    As with any quantitative technique, results depend on careful calibration and reporting of instrument settings, the subject of our guide on reporting analytical methods reproducibly. Standard terminology is held in the CASRAI dictionary, and the wider context appears in our research lifecycle coverage.

    Frequently asked questions

    Why record a waveform instead of a spectrum directly?

    Measuring the field as a function of time preserves phase information that intensity-only spectroscopy discards. A single Fourier transform then yields both amplitude and phase, which together allow direct calculation of refractive index and absorption without additional assumptions.

    How can a slow detector capture a picosecond pulse?

    It does not capture the pulse in one shot. Instead the system samples the field at a sequence of precisely controlled delay times set by an optical delay line, reading one point per delay across many repetitions of the pulse. Assembling these points reconstructs the fast waveform, the same principle as a sampling oscilloscope.

    What does the phase spectrum add?

    The phase encodes the time delay each frequency experiences passing through the sample, which relates directly to refractive index and sample thickness. Having phase alongside amplitude lets researchers extract optical constants unambiguously, a benefit not available to many conventional methods.

    Why the terahertz range specifically?

    The terahertz band coincides with the energies of many lattice vibrations, carrier dynamics and intermolecular modes, making it informative for materials research. The reproducibility considerations for such measurements are discussed in our reproducibility coverage and the author guidance.

  • Neural Networks and Deep Learning Explained

    An artificial neural network is a machine-learning model composed of many simple interconnected units, loosely inspired by biological neurons, that transform input data through successive layers of weighted connections. Deep learning is the use of neural networks with many such layers to learn rich, hierarchical representations directly from data. Together they underpin most of the recent advances in artificial intelligence, from image recognition to the large language models behind generative systems.

    Neurons, weights and activations

    The basic unit, often called a neuron or node, computes a weighted sum of its inputs, adds a bias term, and passes the result through a non-linear activation function. The weights are the model’s learnable parameters; they determine how strongly each input influences the unit’s output. The activation function, such as the rectified linear unit (ReLU) or the sigmoid, introduces non-linearity, which is essential: without it, stacking layers would collapse into a single linear transformation incapable of modelling complex patterns.

    Neurons are organised into layers: an input layer that receives the data, one or more hidden layers that transform it, and an output layer that produces the prediction. Information flows forward through these layers in a process called the forward pass. This architecture is one realisation of the machine-learning ideas described in our explainer on machine learning concepts and methods.

    What “deep” means

    The word deep refers simply to the number of layers. A network with many hidden layers is “deep”, and depth allows the model to build representations in stages: early layers may detect simple features such as edges in an image, while later layers combine these into increasingly abstract concepts such as shapes and objects. This automatic, layered feature learning is what distinguishes deep learning from earlier methods that relied on hand-engineered features. The historical shift to deep networks is traced in our overview of artificial intelligence definition and history.

    Component Role
    Neuron (node) Computes a weighted sum plus bias, then an activation
    Weight Learnable parameter scaling each input
    Activation function Adds non-linearity (e.g. ReLU, sigmoid)
    Layer A group of neurons; depth is the number of layers
    Loss function Measures error between prediction and target

    Training: backpropagation and gradient descent

    A neural network learns by adjusting its weights to reduce a loss function that measures how wrong its predictions are. Training proceeds in two coupled steps. First, the forward pass produces predictions and computes the loss. Second, backpropagation uses the chain rule of calculus to compute the gradient of the loss with respect to every weight, efficiently propagating error signals backward from the output layer to the input layer.

    These gradients tell an optimiser how to change each weight to reduce the loss. Gradient descent, usually in its stochastic mini-batch form, then nudges the weights a small step in the direction that lowers the loss, controlled by a learning rate. Repeating this over many passes through the data (epochs) gradually improves the model. Because the outcome depends on random initialisation, data ordering and these hyperparameters, careful reporting is essential, as discussed in our guide to reproducibility of machine learning research.

    Why documentation matters for neural networks

    Because a trained network is defined by millions of learned weights rather than human-readable rules, transparency depends on documentation: what data trained it, how it was evaluated, and what its limits are. Structured artefacts such as model cards, covered in our piece on AI model documentation, address exactly this need, and the controlled terminology in the casrai.org research dictionary helps keep descriptions consistent across the literature.

    Frequently asked questions

    What makes a neural network “deep”?

    Depth refers to the number of layers. A deep network has many hidden layers, which lets it learn features in stages, from simple patterns in early layers to abstract concepts in later ones.

    What is backpropagation?

    Backpropagation is the algorithm that computes the gradient of the loss with respect to each weight by applying the chain rule backward through the network. These gradients tell the optimiser how to adjust the weights.

    What is the role of an activation function?

    An activation function adds non-linearity to each neuron. Without it, stacking layers would be equivalent to a single linear transformation, and the network could not model complex relationships.

    How does gradient descent train a network?

    Gradient descent repeatedly adjusts the weights by a small step in the direction that reduces the loss, using the gradients from backpropagation and a learning rate to control the step size.