Tag: linear regression

  • Regression Analysis: An Introduction for Researchers

    Regression analysis is a statistical method for modelling the relationship between an outcome variable and one or more predictor variables. In its simplest form, linear regression fits a straight line through a scatter of points to describe how the outcome changes, on average, as a predictor changes. It is one of the most widely used tools for prediction and for quantifying associations in research.

    The linear regression equation

    Simple linear regression summarises the relationship between a predictor x and an outcome y with the equation y = a + bx, where a is the intercept and b is the slope. The intercept is the predicted value of y when x is zero, and the slope is the average change in y for a one-unit increase in x. A positive slope indicates that y rises with x; a negative slope indicates that it falls.

    Least squares estimation

    The line is chosen by the method of ordinary least squares, which finds the slope and intercept that minimise the sum of the squared vertical distances between the observed points and the fitted line. These distances are called residuals. Squaring them, as with variance, prevents positive and negative residuals from cancelling and penalises large errors more heavily. The result is the best-fitting line in the least squares sense.

    Interpreting R-squared

    The coefficient of determination, R², measures the proportion of variance in the outcome that is explained by the model. It ranges from 0 to 1: an R² of 0 means the predictors explain none of the variation, while an R² of 1 means they explain all of it. An R² of 0.64, for example, indicates that 64% of the variation in the outcome is accounted for by the predictor. R² alone does not confirm that a model is correct, however; it should be read alongside residual plots and an assessment of the model’s assumptions.

    Multiple regression

    Multiple regression extends the model to include several predictors at once, taking the form y = a + b₁x₁ + b₂x₂ + … + bₖxₖ. Each slope coefficient estimates the effect of its predictor while holding the others constant, which helps to adjust for confounding variables. This makes multiple regression valuable when several factors plausibly influence an outcome.

    Assumptions of linear regression

    Assumption Meaning
    Linearity The relationship between predictor and outcome is linear
    Independence Residuals are independent of one another
    Homoscedasticity Residual variance is constant across the range of predictions
    Normality of residuals Residuals are approximately normally distributed

    When these assumptions are violated, estimates and p-values can be misleading. Diagnostic plots help to detect problems before results are reported.

    Correlation is not causation

    A statistically significant slope shows that two variables are associated, not that one causes the other. Unmeasured confounders, reverse causation or coincidence can all produce a relationship. Causal claims require careful study design, such as randomised experiments, not regression alone. Stating this limitation clearly is part of transparent, reproducible reporting, as encouraged by the CASRAI dictionary and our author guidance.

    Frequently asked questions

    What is the difference between correlation and regression?

    Correlation measures the strength and direction of a linear association with a single number between −1 and 1. Regression goes further, producing an equation that predicts the outcome and quantifies the effect of each predictor.

    What counts as a good R-squared value?

    It depends entirely on the field. In physical sciences an R² above 0.9 may be expected, whereas in social or biological research values of 0.2 to 0.4 can still be meaningful. Always interpret R² in context.

    Can regression prove causation?

    No. Regression quantifies association and can adjust for measured confounders, but it cannot establish causation on its own. Causal inference requires appropriate design, such as randomisation or robust quasi-experimental methods.