Quantitative data analysis involves using statistical techniques to summarise data and test hypotheses. This page provides an overview of commonly used analysis methods, which have been grouped according to their main purpose. While the list is not comprehensive, each section provides an introduction to the most commonly used methods and their typical applications. Links are also provided to other modules and to external resources which explain the theory behind each method and, in some cases, how to perform analyses in SPSS or Stata.
In addition, you may wish to view one or both of the following resources, which provide guidance on selecting the most appropriate statistical test for your research question and data type:
Descriptive statistics are used to summarise and describe the data in your sample. Common examples include measures of central tendency and measures of dispersion to describe the distribution of single continuous variables, and frequencies and percentages to describe single categorical variables. Graphs, such as histograms, box plots, and bar charts, are also commonly used to visualise distributions.
Descriptive statistics are typically the first step before more complex analyses, as they help you to understand patterns in your data. Simple summaries of relationships, such as correlation between two continuous variables, can also be included in descriptive analyses to explore how variables are linked, although formal interpretation usually involves inferential methods to assess relationships.
If you would like to learn how to calculate descriptive statistics using SPSS or Stata software, you may like to refer to the Descriptive statistics page of the Introduction to SPSS or Introduction to Stata module respectively.
Comparing means involves assessing whether the mean of a continuous dependent (outcome) variable differs between two or more groups (or at two or more time points). The methods used for this purpose, such as t-tests and ANOVA, are parametric, meaning they rely on certain assumptions about the distribution of the data. When these assumptions are not met, non-parametric tests can be used; these methods do not compare means directly, but instead compare the relative ranking of values.
Common methods of comparing means include:
t-tests: used to compare means. A one-sample t-test compares the mean of a single group to a known or hypothesised value, an independent samples t-test compares two separate groups, and a paired samples t-test compares measurements taken from the same individuals at two time points.
Analysis of Variance (ANOVA): an extension of the independent samples t-test that compares means across three or more groups. ANOVA can involve one independent variable (one-way ANOVA) or multiple independent variables (for example, two-way ANOVA). While ANOVA can tell you whether any group differs from the others, post hoc tests are required to determine which specific groups differ.
Repeated measures ANOVA: a type of ANOVA used when the same individuals are measured on a continuous outcome at two or more time points or under two or more conditions. It can also involve one independent variable (one-way repeated measures ANOVA) or multiple independent variables (for example, two-way repeated measures ANOVA).
ANCOVA: similar to ANOVA, but compares adjusted means by controlling for the effects of additional continuous variables (covariates) that might influence the outcome. As with ANOVA, ANCOVA can involve one independent variable (one-way ANCOVA) or multiple independent variables (for example, two-way ANCOVA).
If you would like to learn how to conduct t-tests and one-way ANOVA using SPSS or Stata software, you may like to refer to the Comparing means page of the Introduction to SPSS or Introduction to Stata module respectively.
In addition, you may like to view this online ANOVA resource which provides further explanations and examples of different types of ANOVA.
Assessing relationships involves examining whether and how variables are related to one another. Depending on the type of variables involved (categorical or continuous) and the research question, different statistical methods are used to test for relationships, quantify their strength, and, where appropriate, adjust for other variables.
Common methods for assessing relationships include:
Chi-square test: used to assess whether there is an association between two categorical variables. Variants include the chi-square test of independence (for association between two categorical variables) and the chi-square goodness-of-fit test (for comparing observed frequencies to expected frequencies based on a hypothesised distribution).
Correlation: used to quantify the strength and direction of the relationship between two continuous variables. The most common measure is Pearson’s correlation coefficient, a parametric statistic which assesses linear relationships, while Spearman’s rank correlation coefficient is a non-parametric alternative. Correlation coefficients range from –1 to +1, indicating the strength and direction of the relationship.
Regression: used to model the relationship between a dependent (outcome) variable and one or more independent (predictor) variables. Regression allows estimation of the magnitude and direction of effect sizes, and can adjust for potential confounders. Different types of regression are used depending on the type of dependent variable and the data structure, as detailed below:
Linear regression is used when the dependent variable is continuous. Simple linear regression includes one independent variable, while multiple linear regression includes two or more independent variables. Linear regression estimates the average change in the dependent variable for each unit change in the independent variable(s), assuming a linear relationship.
Logistic regression is used when the dependent variable is binary (two categories). As with linear regression, models may include one independent variable (simple logistic regression) or two or more independent variables (multiple logistic regression). Logistic regression estimates the change in the odds of the outcome for each unit change in the independent variable(s), assuming a linear relationship between the independent variables and the log odds of the outcome.
Mixed-effects regression (also called multilevel or hierarchical regression) is used when data are clustered or involve repeated measurements (for example, patients grouped within hospitals or people with multiple measurements over time). Mixed models can be used with continuous outcomes (linear mixed models) or binary outcomes (generalised linear mixed models), and estimate the relationship between the independent and dependent variables while accounting for the structure of the data. If you would like to learn how to conduct a chi-square test of independence or determine Pearson’s correlation coefficient using SPSS or Stata software, you may like to refer to the Assessing relationships page of the Introduction to SPSS or Introduction to Stata module respectively.
In addition, you may like to view this online regression resource which provides further explanations and examples of different types of regression.
Survival and longitudinal analyses are used when an outcome is measured over time, either until an event occurs (survival) or at multiple time points (longitudinal). These methods account for the timing and correlation of repeated measurements, which standard regression approaches may not handle appropriately.
Common approaches include:
Survival analysis: used when the outcome is time-to-event (for example, time to disease recurrence). Survival methods account for censoring, which occurs when the event has not yet happened for some participants. The most common technique is the Kaplan–Meier method for estimating survival curves, and Cox proportional hazards regression for modelling the effect of predictors on time-to-event outcomes.
Linear mixed models: used for continuous outcomes measured repeatedly over time. These models account for the correlation of repeated measurements within individuals and can include multiple predictors.
Generalised estimating equations (GEE): used for correlated outcomes, including repeated measurements or clustered data, which can be continuous, binary, or count. GEE estimates population-average relationships rather than individual-level effects.
Growth curve modelling: used to analyse change over time for individuals or groups, often using repeated-measures data.
Measures of effect quantify the size and direction of a relationship, providing information about its practical significance. They are commonly used in clinical and epidemiological research to summarise the strength of relationships. Different measures of effect are used for different statistical tests and study designs.
Some common measures of effect include:
Relative risk (risk ratio): used in cohort studies to compare the probability of an event occurring in an exposed group with the probability in an unexposed group. A relative risk of 1 indicates no difference between groups, greater than 1 indicates increased risk, and less than 1 indicates decreased risk in the exposed group.
Odds ratio: commonly used in case control studies or logistic regression to compare the odds of an event between groups. An odds ratio of 1 indicates no association, greater than 1 indicates higher odds, and less than 1 indicates lower odds of the outcome in the exposed group.
Hazard ratio: used in survival analysis to compare the rate at which events occur in different groups over time. A hazard ratio of 1 indicates no difference in the hazard between groups, greater than 1 indicates a higher hazard, and less than 1 indicates a lower hazard in the exposed group.
For more information about practical significance, and the difference between this and practical significance, you may like to view the short video Statistical vs. Practical Significance.
Design-specific approaches are methods that are used in particular study designs to ensure the analysis appropriately reflects the way the study was conducted. These approaches help maintain the validity of results and account for features of the study design.
An example is as follows: