Tag: validity

  • Reliability and Validity in Psychological Measurement

    Reliability is the consistency of a measurement, while validity is whether the measurement captures what it is intended to capture. Together they are the two pillars of psychometrics. A psychological test is only as trustworthy as these properties allow, and reporting them is a basic expectation of credible, reproducible research.

    The three faces of reliability

    Reliability concerns whether a measure gives consistent results. It comes in several forms depending on the source of consistency being examined:

    • Test-retest reliability: do the same people get similar scores when measured again after a delay? High test-retest reliability suggests the instrument captures a stable attribute rather than transient noise.
    • Inter-rater reliability: when human raters score the same behaviour, do they agree? Strong inter-rater reliability shows that the result reflects the thing observed, not the observer.
    • Internal consistency: do items on a scale that are meant to measure one construct correlate with each other? This is commonly summarised by Cronbach’s alpha, which indexes how well a set of items hang together.

    The three faces of validity

    Validity concerns meaning—whether the score corresponds to the intended construct. The main types are:

    • Construct validity: does the test actually measure the abstract concept it targets, such as anxiety or numerical ability? Evidence accumulates from how scores relate to other measures as theory predicts.
    • Content validity: do the items adequately sample the full domain? A maths test that only covered addition would have poor content validity for general numeracy.
    • Criterion validity: does the score predict or correspond to an external benchmark, such as later performance or an established gold-standard measure?

    Reliability and validity at a glance

    Property Type Key question
    Reliability Test-retest Are scores stable over time?
    Reliability Inter-rater Do different raters agree?
    Reliability Internal consistency (Cronbach’s alpha) Do items measure one thing together?
    Validity Construct Does it measure the intended concept?
    Validity Content Do items cover the whole domain?
    Validity Criterion Does it predict a relevant outcome?

    Why a measure can be reliable but not valid

    This is the most important conceptual point in psychometrics, and it is worth stating carefully. Reliability is necessary but not sufficient for validity. A bathroom scale that always reads three kilograms heavy is perfectly reliable—it gives the same answer every time—yet it is not a valid measure of weight, because it is consistently wrong. Likewise, a personality questionnaire can produce stable scores that nonetheless do not correspond to the trait it claims to assess. A measure cannot be valid without being reliable, but it can be reliable without being valid. Validity is therefore the higher bar. The practical implication is that demonstrating consistency is only the first step; an instrument must additionally be shown to track the construct it names before its scores can support any substantive claim.

    How reliability is estimated in practice

    Each form of reliability has a characteristic study design. Test-retest reliability is estimated by administering the same measure to the same people twice and correlating the two sets of scores; the delay must be long enough that memory of the first sitting does not inflate agreement, but short enough that the trait itself has not genuinely changed. Inter-rater reliability is assessed by having two or more trained raters score the same material independently and computing their agreement, often with a coefficient that corrects for chance. Internal consistency is calculated from a single administration by examining how the items intercorrelate, with Cronbach’s alpha the most familiar summary. Reporting which coefficient was used, and its value, lets readers judge whether a measure is fit for purpose.

    A note on Cronbach’s alpha

    Alpha is ubiquitous but frequently misread. A high value does not by itself prove a scale measures a single construct; it is sensitive to the number of items, so long scales can post a respectable alpha even when their items are only loosely related. Conversely, a very high alpha may signal redundant, near-duplicate items rather than a well-rounded measure. Alpha is therefore best treated as one piece of evidence about internal structure, interpreted alongside the scale’s design and its factor structure, not as a single pass-or-fail threshold.

    Validity is an accumulating argument

    Modern psychometrics treats validity less as a fixed property a test “has” and more as an evidence-based argument that builds over time. Construct, content and criterion evidence each contribute, and a measure earns confidence as independent studies show its scores behaving as theory predicts—correlating with related measures, diverging from unrelated ones and predicting relevant outcomes. This framing explains why a brand-new instrument cannot simply be declared valid; validity is demonstrated through replication, which ties measurement quality directly to the field’s reproducibility agenda.

    Implications for research and assessment

    These properties are not academic niceties; they determine whether a finding will replicate. Instruments with poor reliability add noise that can mask real effects or generate spurious ones, a concern at the heart of the field’s work on reproducibility. Many critiques of popular tools reduce to validity questions—for example, the measurement objections to the Myers-Briggs Type Indicator concern reliability and construct validity. Sound responsible assessment requires that both properties be measured and disclosed.

    Reliability, error and the individual score

    Reliability has a direct, practical meaning for how much trust to place in a single person’s score. Every observed score can be thought of as a true score plus measurement error, and the lower the reliability, the larger that error band. The standard error of measurement translates a reliability coefficient into a margin of uncertainty around an individual’s result, which is why responsible test reports present scores as ranges rather than precise points. Ignoring this band is a common misuse: treating a one-point difference between two people as meaningful when it falls well within measurement error. For consequential decisions, the size of the error band can matter as much as the score itself, and it should be reported alongside the headline number.

    Reporting psychometrics transparently

    Researchers should report which reliability and validity evidence supports each measure, ideally with the relevant coefficients. Consistent terminology helps: defining terms in a shared research dictionary lets readers compare studies, and clear guidance for authors turns good intentions into routine practice. Transparency about measurement is one of the cheapest ways to improve the reliability of the literature as a whole.

    Frequently asked questions

    What is the difference between reliability and validity?

    Reliability is consistency—getting the same result repeatedly—while validity is accuracy—measuring the intended construct. A test must be reliable to be valid, but reliability alone does not guarantee validity.

    Can a test be reliable but not valid?

    Yes. A scale that consistently reads three kilograms too heavy is reliable but not valid. The result is stable yet systematically wrong, so it does not measure true weight.

    What is Cronbach’s alpha?

    Cronbach’s alpha is a common index of internal consistency. It estimates how well the items on a scale that are meant to measure one construct correlate with one another.

    Why do reliability and validity matter for reproducibility?

    Measures with weak reliability or validity add noise and bias, making findings harder to replicate. Reporting these properties is part of producing reproducible, trustworthy research.

  • DISC and Personality Assessment as Measurement Science

    The DISC test is a self-report behavioural assessment that profiles people across four dimensions: Dominance, Influence, Steadiness and Conscientiousness. Widely used in workplace training and team-building, DISC traces to the work of psychologist William Moulton Marston in the 1920s. Like any personality instrument, its usefulness depends not on popularity but on how well it satisfies measurement-science criteria—reliability, validity and fitness for the purpose to which it is applied.

    Origins and the four dimensions

    Marston proposed a model of normal human emotions and behaviour built around two axes—how a person perceives their environment (favourable or antagonistic) and how active or passive they feel in it. Later authors operationalised these ideas into questionnaires that yield the familiar DISC profile.

    Dimension Behavioural emphasis
    Dominance (D) Directness, control, results focus
    Influence (I) Sociability, persuasion, enthusiasm
    Steadiness (S) Patience, cooperation, stability
    Conscientiousness (C) Accuracy, structure, attention to detail

    Marston was a theorist of emotion, not of psychometric test construction; he did not design DISC as a rigorously validated assessment, and modern commercial versions vary in quality. It is also worth noting that the four labels describe behavioural styles—how a person tends to act in a given context—rather than fixed, deep-seated traits. Behaviour is partly situational, so a profile captured at one moment in one setting may not generalise to another, a point that should temper any strong claims drawn from a single administration.

    How such instruments are evaluated

    Any personality measure should be judged on the standard psychometric criteria. Reliability covers consistency: test-retest stability, internal consistency (often summarised by Cronbach’s alpha) and, where raters are involved, inter-rater agreement. Validity covers meaning: construct validity (does the test measure the trait it claims?), content validity (do the items sample the domain?) and criterion validity (does the score predict relevant outcomes?). A reputable instrument publishes these properties; a marketing brochure that omits them is a warning sign.

    Norms, fairness and interpretation

    A frequently overlooked requirement of any assessment used with people is a defensible set of norms—the reference data against which an individual’s score is interpreted. A raw DISC profile means little without knowing the population it is compared against; a score that looks “high” relative to one norm group may be average against another. Responsible use therefore depends on the publisher documenting who the norm sample was, how large it was and when it was collected, and on practitioners checking that those norms are appropriate for the people being assessed. Where norms are outdated, unrepresentative or undisclosed, interpretations risk being unfair, particularly if results feed into decisions about individuals. These fairness considerations are part of why such tools are better confined to development than to selection.

    Strengths and limitations of DISC

    DISC’s appeal is its simplicity and a shared vocabulary for discussing communication styles. Used as a facilitation aid, it can prompt useful reflection and dialogue. Its limitations are also clear. The model focuses on observable behavioural style rather than the broad trait structure recovered in academic research, and independent peer-reviewed validation is thinner than for established inventories. Where rigorous prediction is required, researchers more often turn to dimensional models such as the Big Five, which have stronger published reliability and validity—a contrast also seen in critiques of the Myers-Briggs Type Indicator.

    Appropriate versus inappropriate uses

    The measurement-science verdict on DISC is best stated as a matter of fit:

    • Appropriate: stimulating self-awareness, opening conversations about working styles, and structuring team-building discussions where no high-stakes decision rides on the result.
    • Inappropriate: selecting or rejecting job candidates, denying promotions, or making any consequential judgement about a person, especially when the specific version’s predictive validity is undocumented.

    This distinction is central to responsible assessment: a tool can be valuable for development and yet wholly unsuitable for selection. Treating a developmental instrument as a gatekeeping test imports risks of unfairness and poor decisions.

    From Marston’s theory to commercial instruments

    It is worth separating the model from the products built on it. Marston set out his ideas in his 1928 book on the emotions of normal people, describing behavioural tendencies along his two axes. He did not, however, publish a validated assessment. The questionnaires sold today were developed later by various authors and publishers, who differ in their item construction, scoring and norming. As a result, “DISC” is not a single standardised test but a family of instruments of varying quality, and a positive evaluation of one product does not transfer to another that merely shares the name. A measurement-science evaluation must therefore target the specific version in use, asking for its technical manual, sample sizes, reliability coefficients and validity studies rather than accepting the brand at face value.

    Ipsative scoring and its consequences

    Many DISC instruments use a forced-choice, or ipsative, format in which respondents rank options against one another rather than rating each independently. Ipsative scoring has a known measurement drawback: because raising one score necessarily lowers others, the dimensions are not independent, which complicates comparisons between people and can distort the apparent profile. This is a technical reason that some DISC products are better suited to within-person reflection (“which of my tendencies is strongest?”) than to between-person ranking (“is candidate A more dominant than candidate B?”). Recognising the scoring model is part of judging whether a tool fits the intended use.

    A checklist for evaluating any personality tool

    The questions that should be asked of DISC apply to every commercial assessment: Is there an independent, peer-reviewed evidence base, or only publisher materials? Are reliability and validity coefficients published and adequate? Is the scoring normative or ipsative, and does that suit the purpose? Is the instrument being used for development or for a high-stakes decision? Applying this checklist consistently is what separates responsible assessment from the uncritical adoption of whichever tool is most heavily marketed.

    Reporting and transparency

    When personality data informs research or organisational practice, the instrument, its version and its psychometric evidence should be reported plainly so others can judge the result. This transparency mirrors the wider push for clear, reusable terminology recorded in a controlled research dictionary, and it gives authors a defensible basis for the claims they make. Without it, scores carry an unearned air of precision.

    Frequently asked questions

    What does DISC stand for?

    DISC stands for Dominance, Influence, Steadiness and Conscientiousness—four behavioural dimensions used to describe a person’s typical working and communication style.

    Who created the DISC model?

    The underlying theory comes from psychologist William Moulton Marston in the 1920s. Later authors and commercial publishers turned his ideas into the questionnaires now marketed as DISC assessments.

    Is DISC scientifically valid?

    It depends on the specific version. DISC is useful as a development and communication aid, but independent validity evidence is variable, so it should not be used for hiring or other high-stakes decisions.

    How should organisations use DISC responsibly?

    Limit it to self-awareness and team discussion, avoid using it to screen candidates, and report the instrument and its psychometric properties so results can be judged on their evidence.

  • The MBTI: A Measurement-Science Critique of the Myers-Briggs Type Indicator

    The Myers-Briggs Type Indicator (MBTI) is a self-report personality questionnaire that classifies respondents into one of 16 “types” using four dichotomies. Developed by Katharine Cook Briggs and Isabel Briggs Myers from Carl Jung’s theory of psychological types, it remains popular in workplaces and coaching. From a measurement-science perspective, however, the instrument has well-documented weaknesses in reliability and validity that explain why academic personality psychology rarely uses it.

    The four dichotomies and 16 types

    The MBTI scores respondents on four opposing pairs and combines the results into a four-letter code:

    Dichotomy Poles Question it addresses
    Attitude Extraversion (E) – Introversion (I) Where attention is directed
    Perceiving function Sensing (S) – Intuition (N) How information is taken in
    Judging function Thinking (T) – Feeling (F) How decisions are made
    Orientation Judging (J) – Perceiving (P) Preferred way of engaging the world

    The four binary outcomes multiply to 16 type codes such as INTJ or ESFP. Each is presented as a qualitatively distinct category rather than a position on a scale.

    The dichotomisation problem

    The central measurement objection is that the MBTI treats continuous traits as categories. Empirical trait distributions are typically unimodal and roughly bell-shaped, not bimodal: most people cluster near the middle rather than at one pole. Imposing a cut-point splits a continuum into two boxes and discards information. Someone scoring just over the boundary is grouped with people far more extreme, while two near-identical respondents either side of the line receive different letters. This is why a small shift on retest can flip a whole type.

    Reliability concerns

    Reliability is the consistency of a measure. Test-retest reliability asks whether the same person obtains the same result on a later occasion. Studies have reported that a substantial proportion of respondents receive a different four-letter type when retested weeks later. Because the type is the headline output, even modest instability at each dichotomy compounds across four binary decisions, undermining the categorical claim that people “are” a fixed type.

    Validity concerns

    Validity asks whether an instrument measures what it claims and predicts what it should. The MBTI’s construct validity is questioned because its Thinking–Feeling and Judging–Perceiving axes do not map cleanly onto the trait structure repeatedly recovered in factor-analytic research. Criterion validity is also limited: type codes are weak predictors of job performance, and the instrument was not designed to rank or select candidates. Using it for hiring or promotion is an inappropriate application that conflicts with responsible-assessment principles.

    Why personality psychology prefers the Big Five

    The dominant model in academic personality research is the Big Five, or Five-Factor Model: Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism. Unlike the MBTI, it is dimensional rather than typological, so each person receives a continuous score on every factor. The five factors emerged from decades of factor analysis across languages and cultures, show stronger reliability and better criterion validity, and avoid the artefacts introduced by dichotomising. The MBTI’s Extraversion–Introversion axis broadly aligns with the Big Five’s Extraversion dimension, but the framework as a whole captures gradation that a 16-box scheme cannot. A further contrast is that the Big Five includes Neuroticism—a well-replicated dimension of emotional stability with substantial predictive value—which the MBTI omits entirely, leaving a meaningful part of personality unmeasured.

    The Jungian foundations and where the model departs

    The MBTI’s intellectual lineage runs back to Carl Jung’s 1921 work on psychological types, which proposed attitudes (introversion and extraversion) and functions (sensing, intuition, thinking, feeling). Briggs and Myers, who were not academic psychologists, formalised these ideas into a scored questionnaire and added the Judging–Perceiving axis to identify which function a person leads with. The difficulty is that Jung’s typology was a clinical and theoretical scheme, never validated as a measurement instrument. Building a forced-choice questionnaire on top of it inherited the typological assumption—that people fall into discrete kinds—without testing whether the data support discreteness. Modern psychometric research generally finds they do not: trait scores vary smoothly, so the categories are imposed rather than discovered.

    What the evidence base actually looks like

    Much of the supportive literature for the MBTI has appeared in outlets associated with the instrument’s publishers rather than in independent, peer-reviewed personality journals. Independent reviews have repeatedly raised the same points: limited test-retest stability for the overall type, factor structures that do not cleanly reproduce the four advertised dimensions as fully independent, and weak incremental prediction of real-world outcomes once general traits are accounted for. By contrast, the Big Five literature spans thousands of independent studies, multiple languages and decades of replication. This asymmetry in the evidence base is itself a measurement-science signal: an instrument with strong properties tends to accumulate convergent, independent support.

    How to read a type result responsibly

    If an organisation already uses the MBTI, the responsible stance is to treat the four-letter code as a conversation starter, not a verdict. A type should never be recorded on a personnel file, used to allocate roles, or invoked to explain away a colleague’s behaviour. Because the result can change between sittings, any decision that would differ depending on which side of a cut-point someone landed is, by construction, unsafe. Where genuine measurement is needed—research, selection, or development tracking—a dimensional inventory with published reliability and validity is the defensible choice. Documenting which instrument was used and why, much as researchers record terms in a controlled research dictionary, lets others judge the evidence behind a claim.

    A balanced reading

    None of this makes the MBTI useless as a conversational vocabulary or a self-reflection prompt; many people find the language engaging. The measurement-science point is narrower and evidence-based: a tool valued for facilitation should not be repurposed as a precise, predictive instrument for high-stakes decisions. Practitioners who need defensible measurement should consult validated dimensional inventories and document their psychometric properties. The wider lesson connects to reproducibility reform: popularity is not evidence, and instruments deserve the same scrutiny as the findings they generate.

    Frequently asked questions

    Is the MBTI scientifically valid?

    The MBTI has well-documented limitations in reliability and validity. Critics highlight unstable retest results and weak prediction of outcomes such as job performance, which is why it is uncommon in peer-reviewed personality research.

    Why do MBTI results sometimes change between tests?

    Because the instrument places hard cut-points on continuous traits, people who score near a boundary can flip to the opposite letter on a small change. Across four dichotomies, this produces a different overall type.

    What is the difference between the MBTI and the Big Five?

    The MBTI sorts people into 16 categorical types, whereas the Big Five gives continuous scores on five dimensions. The Big Five generally shows stronger reliability and validity and is the standard in academic work. Authors reporting personality measures should describe the model and its psychometrics.

    Should the MBTI be used for hiring?

    No. The instrument was not designed for selection and its criterion validity for job performance is weak. Using categorical type codes to screen candidates conflicts with responsible-assessment practice.