Tag: psychometrics

Reliability and Validity in Psychological Measurement

Reliability is the consistency of a measurement, while validity is whether the measurement captures what it is intended to capture. Together they are the two pillars of psychometrics. A psychological test is only as trustworthy as these properties allow, and reporting them is a basic expectation of credible, reproducible research.

The three faces of reliability

Reliability concerns whether a measure gives consistent results. It comes in several forms depending on the source of consistency being examined:

Test-retest reliability: do the same people get similar scores when measured again after a delay? High test-retest reliability suggests the instrument captures a stable attribute rather than transient noise.
Inter-rater reliability: when human raters score the same behaviour, do they agree? Strong inter-rater reliability shows that the result reflects the thing observed, not the observer.
Internal consistency: do items on a scale that are meant to measure one construct correlate with each other? This is commonly summarised by Cronbach’s alpha, which indexes how well a set of items hang together.

The three faces of validity

Validity concerns meaning—whether the score corresponds to the intended construct. The main types are:

Construct validity: does the test actually measure the abstract concept it targets, such as anxiety or numerical ability? Evidence accumulates from how scores relate to other measures as theory predicts.
Content validity: do the items adequately sample the full domain? A maths test that only covered addition would have poor content validity for general numeracy.
Criterion validity: does the score predict or correspond to an external benchmark, such as later performance or an established gold-standard measure?

Reliability and validity at a glance

Property	Type	Key question
Reliability	Test-retest	Are scores stable over time?
Reliability	Inter-rater	Do different raters agree?
Reliability	Internal consistency (Cronbach’s alpha)	Do items measure one thing together?
Validity	Construct	Does it measure the intended concept?
Validity	Content	Do items cover the whole domain?
Validity	Criterion	Does it predict a relevant outcome?

Why a measure can be reliable but not valid

This is the most important conceptual point in psychometrics, and it is worth stating carefully. Reliability is necessary but not sufficient for validity. A bathroom scale that always reads three kilograms heavy is perfectly reliable—it gives the same answer every time—yet it is not a valid measure of weight, because it is consistently wrong. Likewise, a personality questionnaire can produce stable scores that nonetheless do not correspond to the trait it claims to assess. A measure cannot be valid without being reliable, but it can be reliable without being valid. Validity is therefore the higher bar. The practical implication is that demonstrating consistency is only the first step; an instrument must additionally be shown to track the construct it names before its scores can support any substantive claim.

How reliability is estimated in practice

Each form of reliability has a characteristic study design. Test-retest reliability is estimated by administering the same measure to the same people twice and correlating the two sets of scores; the delay must be long enough that memory of the first sitting does not inflate agreement, but short enough that the trait itself has not genuinely changed. Inter-rater reliability is assessed by having two or more trained raters score the same material independently and computing their agreement, often with a coefficient that corrects for chance. Internal consistency is calculated from a single administration by examining how the items intercorrelate, with Cronbach’s alpha the most familiar summary. Reporting which coefficient was used, and its value, lets readers judge whether a measure is fit for purpose.

A note on Cronbach’s alpha

Alpha is ubiquitous but frequently misread. A high value does not by itself prove a scale measures a single construct; it is sensitive to the number of items, so long scales can post a respectable alpha even when their items are only loosely related. Conversely, a very high alpha may signal redundant, near-duplicate items rather than a well-rounded measure. Alpha is therefore best treated as one piece of evidence about internal structure, interpreted alongside the scale’s design and its factor structure, not as a single pass-or-fail threshold.

Validity is an accumulating argument

Modern psychometrics treats validity less as a fixed property a test “has” and more as an evidence-based argument that builds over time. Construct, content and criterion evidence each contribute, and a measure earns confidence as independent studies show its scores behaving as theory predicts—correlating with related measures, diverging from unrelated ones and predicting relevant outcomes. This framing explains why a brand-new instrument cannot simply be declared valid; validity is demonstrated through replication, which ties measurement quality directly to the field’s reproducibility agenda.

Implications for research and assessment

These properties are not academic niceties; they determine whether a finding will replicate. Instruments with poor reliability add noise that can mask real effects or generate spurious ones, a concern at the heart of the field’s work on reproducibility. Many critiques of popular tools reduce to validity questions—for example, the measurement objections to the Myers-Briggs Type Indicator concern reliability and construct validity. Sound responsible assessment requires that both properties be measured and disclosed.

Reliability, error and the individual score

Reliability has a direct, practical meaning for how much trust to place in a single person’s score. Every observed score can be thought of as a true score plus measurement error, and the lower the reliability, the larger that error band. The standard error of measurement translates a reliability coefficient into a margin of uncertainty around an individual’s result, which is why responsible test reports present scores as ranges rather than precise points. Ignoring this band is a common misuse: treating a one-point difference between two people as meaningful when it falls well within measurement error. For consequential decisions, the size of the error band can matter as much as the score itself, and it should be reported alongside the headline number.

Reporting psychometrics transparently

Researchers should report which reliability and validity evidence supports each measure, ideally with the relevant coefficients. Consistent terminology helps: defining terms in a shared research dictionary lets readers compare studies, and clear guidance for authors turns good intentions into routine practice. Transparency about measurement is one of the cheapest ways to improve the reliability of the literature as a whole.

Frequently asked questions

What is the difference between reliability and validity?

Reliability is consistency—getting the same result repeatedly—while validity is accuracy—measuring the intended construct. A test must be reliable to be valid, but reliability alone does not guarantee validity.

Can a test be reliable but not valid?

Yes. A scale that consistently reads three kilograms too heavy is reliable but not valid. The result is stable yet systematically wrong, so it does not measure true weight.

What is Cronbach’s alpha?

Cronbach’s alpha is a common index of internal consistency. It estimates how well the items on a scale that are meant to measure one construct correlate with one another.

Why do reliability and validity matter for reproducibility?

Measures with weak reliability or validity add noise and bias, making findings harder to replicate. Reporting these properties is part of producing reproducible, trustworthy research.

June 20, 2026

What Is Psychology? Scope, Methods and the Scientific Discipline

Psychology is the scientific study of mind and behaviour, using systematic observation, measurement and experiment to build and test theories. As an empirical discipline it spans the biological, cognitive, developmental, social and individual aspects of how people and animals perceive, think, feel and act. The American Psychological Association (APA) frames it as a science grounded in evidence rather than intuition or anecdote.

The scope of the discipline

Psychology sits at the intersection of the natural and social sciences. It draws on biology and neuroscience to understand the brain, on statistics to quantify behaviour, and on social science to study groups and culture. Its defining commitment is methodological: claims about the mind are evaluated against data gathered under controlled, reproducible conditions rather than accepted on authority. That commitment distinguishes scientific psychology from folk or popular psychology, which may offer intuitively appealing explanations that have never been tested. The discipline’s value lies in its willingness to discard attractive ideas when evidence contradicts them, and to quantify uncertainty rather than asserting confident conclusions about complex human behaviour.

Major subfields

Subfield	Central question
Cognitive psychology	How do attention, memory, language and reasoning work?
Developmental psychology	How do mind and behaviour change across the lifespan?
Social psychology	How do others influence thought, feeling and action?
Biological psychology	How do brain and body underpin behaviour?
Personality & individual differences	How and why do people differ in stable ways?
Clinical & counselling	How are psychological difficulties understood and supported?

Research methods

Psychology relies on a toolkit of complementary methods. Experiments manipulate one variable while holding others constant to test cause and effect, ideally with random assignment to conditions. Observational and correlational studies measure variables as they naturally occur, describing associations without claiming causation. Psychometrics is the science of building and evaluating measures—questionnaires, ability tests and rating scales—so that scores are consistent and meaningful. Underpinning all of these is careful attention to reliability and validity, the twin pillars of sound measurement.

Quantitative and qualitative approaches

Psychological research is often divided into quantitative and qualitative traditions, and mature programmes frequently combine them. Quantitative work expresses phenomena as numbers and analyses them statistically, prioritising measurement, comparison and generalisation across large samples. Qualitative work—interviews, focus groups, thematic analysis of text—seeks rich, contextual understanding of how people make meaning, and is well suited to generating hypotheses or studying experiences that resist tidy quantification. Neither is inherently superior; the appropriate method depends on the question. A study estimating how common an attitude is needs quantitative survey methods, whereas one exploring why people hold that attitude may begin qualitatively. Mixed-methods designs deliberately pair the two so that numerical breadth and interpretive depth inform each other.

The scientific method in psychology

Psychological research follows the general cycle of the scientific method: observe a phenomenon, derive a testable hypothesis, design a study, collect and analyse data, and revise theory in light of results. Because human behaviour is variable, psychologists lean heavily on statistics to separate genuine effects from chance. The discipline has also become more reflective about its own methods following the replication crisis, adopting practices such as preregistration and data sharing to strengthen the reliability of published findings.

Measurement and assessment

Much of psychology depends on turning abstract constructs—intelligence, anxiety, conscientiousness—into numbers. This is harder than it looks, and the field has a long tradition of scrutinising its instruments. Popular tools are not automatically trustworthy: assessments such as the Myers-Briggs Type Indicator illustrate how an instrument can be widely used yet fall short on psychometric grounds. Responsible practice means reporting how a measure was validated, a discipline reflected in CASRAI’s work on responsible assessment.

A short history of the discipline

Psychology emerged as a distinct experimental science in the late nineteenth century, conventionally dated to Wilhelm Wundt’s establishment of a dedicated laboratory in Leipzig in 1879. Early schools—structuralism, functionalism and later behaviourism—debated whether psychology should study inner experience or only observable behaviour. The mid-twentieth-century cognitive revolution restored the study of mental processes such as memory and attention using rigorous experimental methods, and the subsequent rise of neuroscience linked those processes to brain function. This trajectory matters because it shows the field repeatedly tightening its methods, a self-correcting tendency that continues in today’s reforms.

Statistics and inference

Because behaviour varies between people and occasions, psychology cannot rely on single observations. It uses inferential statistics to ask whether a pattern in a sample is likely to hold in the wider population. Two ideas are central: effect size, which expresses how large a difference or relationship is, and statistical power, the probability that a study will detect a real effect if one exists. Underpowered studies—those with samples too small to reliably find the effects they seek—produce unstable, often exaggerated results. Understanding these concepts is essential to reading psychological research critically, and their neglect contributed directly to the field’s reproducibility problems.

Distinguishing good evidence from popular myth

A practical skill the discipline cultivates is separating well-supported findings from appealing but shaky claims. Many ideas that circulate as “psychology” in popular media—rigid personality types, single-study effects presented as laws, or memorable graphs taken at face value—rest on weaker foundations than their fame suggests. Sound practice asks how a finding was measured, whether it has replicated, and how large the effect actually is. This is why the field places such weight on reproducibility and on transparent reporting: a claim is only as good as the method behind it.

Ethics in psychological research

Because psychology studies people, it is bound by strong ethical standards. Core principles include informed consent, the right to withdraw, minimisation of harm, confidentiality and, where deception is unavoidable, careful debriefing. Institutional ethics committees, often called institutional review boards, review proposals before data collection begins, and professional bodies such as the APA publish detailed ethics codes. These safeguards became more formalised after historical cases in which participants were exposed to undue stress, and they now shape study design from the outset. Such governance is part of the wider research lifecycle that good metadata and clear terminology, recorded in resources like the research dictionary, are designed to support.

Frequently asked questions

Is psychology a science?

Yes. Psychology uses the scientific method—systematic observation, hypothesis testing, controlled experiments and statistical analysis—to study mind and behaviour, and it revises its theories in light of replicable evidence.

What are the main branches of psychology?

Major subfields include cognitive, developmental, social, biological, personality and clinical psychology. They share common methods but differ in the questions they ask and the populations and processes they study.

What methods do psychologists use?

Psychologists use experiments, observational and correlational studies, and psychometric testing, supported by statistics. Method choice depends on whether the goal is to establish causation, describe associations or measure an attribute reliably.

Why does measurement matter so much in psychology?

Because psychological constructs are abstract, conclusions are only as good as the instruments used. Reliable, valid measures are essential, which is why the field scrutinises its tests and encourages transparent reporting for authors.

June 18, 2026