Access provided by
London Metropolitan University
• To learn the rationale for complementing or replacing a P-value with a Bayesian inference
• To understand the basic concepts and compelling statistical and practical benefits of Bayesian analysis
• To learn about the two Bayesian approaches increasingly advocated as alternatives or complements to commonly applied P-value statistics
• To identify user-friendly software available for Bayesian analysis
Background The statistical shortcomings of null hypothesis significance testing (NHST) are well documented, yet it continues to be the default paradigm in quantitative healthcare research. This is due partly to unfamiliarity with Bayesian statistics.
Aim To highlight some of the theoretical and practical benefits of using Bayesian analysis.
Discussion A growing body of literature demonstrates that Bayesian analysis offers statistical and practical benefits that are unavailable to researchers who rely solely on NHST. Bayesian analysis uses prior information in the inference process. It tests a hypothesis and yields the probability of that hypothesis, conditional on the observed data; in contrast, NHST checks observed data – and more extreme unobserved data – against a hypothesis and yields the long-term probability of the data based on repeated hypothetical experiments. Bayesian analysis provides quantification of the evidence for the null and alternative hypothesis, whereas NHST does not provide evidence for the null hypothesis. Bayesian analysis allows for multiplicity of testing without corrections, whereas NHST multiplicity requires corrections. Finally, it allows sequential data collection with variable stopping, whereas NHST sequential designs require specialised statistical approaches.
Conclusion The Bayesian approach provides statistical, practical and ethical advantages over NHST.
Implications for practice The quantification of uncertainty provided by Bayesian analysis – particularly Bayesian parameter estimation – should better inform evidence-based clinical decision-making. Bayesian analysis provides researchers with the freedom to analyse data in real time with optimal stopping when the data is persuasive and continuing when data is weak, thereby ensuring better use of the researcher’s time and resources
Nurse Researcher. doi: 10.7748/nr.2023.e1852
Peer reviewThis article has been subject to external double-blind peer review and checked for plagiarism using automated software
Correspondence Conflict of interestNone declared
Malone HE, Coyne I (2023) Bayesian analysis for nurse and midwifery research: statistical, practical and ethical benefits. Nurse Researcher. doi: 10.7748/nr.2023.e1852
Published online: 19 January 2023
A fundamental aspect of evidence-based practice is that nurses and midwives should use the best available evidence (Cleary-Holdforth et al 2021) from qualitative and quantitative research (Malone and Coyne 2017, Hosseini et al 2021). There are various approaches to analysing quantitative evidence, but null hypothesis significance testing (NHST) has dominated statistical inference in medicine (Kelter 2020) and nursing and midwifery research (Malone and Coyne 2017) for several decades (Kruschke 2011).
Unfortunately, NHST has well documented statistical deficiencies (Edwards et al 1963, Cohen 1994, Wagenmakers 2007, Dienes 2011, Kruschke 2011). But an alternative approach, Bayesian analysis, can solve many of these problems (Kruschke 2011).
Quantitative research findings can potentially be translated into clinical practice. Nurse researchers should therefore consider whether it is better to use the NHST paradigm or the Bayesian paradigm to inform their evidence-based, clinical decision-making. This article is intended to provide compelling reasons why nurses and midwives should make use of the statistical and practical advantages of the Bayesian paradigm.
• Bayesian analysis provides a solution to the well-documented shortcoming of NHST significance testing
• Bayesian inference makes available statistical and practical benefits that are unavailable to researchers who rely solely on NHST
• Bayesian analysis has ethical benefits and helps resources to be used better, as it allows for optimal stopping of data collection when the evidence is compelling or for continuing data collection when evidence is weak
• Bayesian analysis provides the researcher with the probability of the hypothesis rather than the inverse probability given by NHST
Several terms are used to denote NHST, including ‘frequentist statistics’ (Marsman and Wagenmakers 2017), ‘classical statistics’ (Wagenmakers et al 2018a) and ‘error-based statistics’ (Goodman 1999a). Researchers who use NHST are trained to frame their research questions by asking whether null values can be rejected. NHST calculates a test statistic, such as z, t, F or X2 (Abebe 2019), which is then converted to a P-value, which is used to make a binary decision as to whether the research result is ‘statistically significant or not’ (Burnham and Anderson 2014). Frequentist researchers make inferences based on the long-term expected performance of these test statistics for a hypothetical infinite repetition of experiments (Haucke et al 2021).
NHST is a hybrid (Garberich and Stanberry 2018) of Fisher’s P-value (Fisher 1928) and Neyman and Pearson’s long-term error rates α and β (Neyman and Pearson 1933), where α corresponds to false positives (type I errors) and β to false negatives (type II errors). Researchers set these long-term error rates at ‘acceptable’ levels (Dienes 2011) – α is typically set at 5% (sometimes 1%) and β is typically set at 20%, which corresponds to a statistical power of 80% (Goodman 1999a). If the data’s P-value is less than an arbitrary cut-off point – for example, 5% – the result is deemed ‘significant’ and the null hypothesis is rejected. The use of a threshold P-value as a basis for rejecting the null hypothesis is called a ‘significance test’ (Goodman 1993). It has been recommended in recent years that researchers report the calculated P-value, rather than simply stating whether the P-value is less than or greater than the cut-off point (Wasserstein et al 2019).
The ‘new statistics’ advocates shifting from significance testing to the consideration of effect sizes, confidence intervals and meta-analysis (Cumming 2014). P-values should therefore be interpreted in the context of a meaningful effect size of practical and/or clinical importance (Malone et al 2016, Betensky 2019). However, this has some limitations:
• The NHST confidence interval for the effect size is also based on long-term hypothetical repeated measurements (Hoekstra et al 2014).
• The average effect size for published studies in meta-analysis is much larger if the effect size for unpublished studies is excluded from the averaging because of ‘publication bias’ (Cumming 2014).
Bayes’ theorem is the basis of Bayesian analysis. It was formulated by Thomas Bayes around 1763 and pioneered by Harold Jeffreys for statistical applications (Edwards et al 1963).
Bayesian analysis takes ‘prior information’ from previous studies or beliefs and uses Bayes’ rule and the ‘likelihood function’ (Kruschke 2011) to update ‘the prior’ with newly obtained data (Quintana et al 2017).
The three constituent parts of Bayesian analysis are (Halter 2018):
1. The ‘prior distribution’ – the researcher’s belief about the credible effect sizes before observing the new data.
2. The ‘likelihood’ – the extent to which the data updates the prior distribution.
3. The ‘posterior distribution’ – an updated estimate of the distribution of credible effect sizes. The posterior distribution can be summarised by reporting a median and credibility interval.
Null values can be assessed using one or both of two methods underpinned by Bayes’ rule (Kruschke 2011): Bayesian model comparison (BMC) and Bayesian parameter estimation (BPE), Bayesian analysis also uses the term ‘model’ interchangeably with ‘hypothesis’ and ‘theory’. Rouder et al (2018) also provides an overview of these two approaches.
Although they are distinctly different paradigms, BMC is the nearest Bayesian counterpart to NHST. BMC yields the relative credibility of a null hypothesis versus an alternative hypothesis (Kruschke 2011). BMC gives a ‘Bayes factor’, which is formally defined as the ratio between the marginal likelihoods of two models, for example, the null model and the alternative model (Quintana and Williams 2018). The Bayes factor quantifies on a continuous scale the relative evidence provided by the data for the two competing hypotheses, with the result presented as a Bayes factor ratio. The evidence given by the Bayes factor can be classified as ‘inconclusive’, ‘anecdotal’, ‘moderate’, ‘strong’, ‘very strong’ and ‘extreme’ using a Bayes classification table (Quintana and Williams 2018).
BMC typically compares a simple null hypothesis with a ‘composite’ of hypotheses – many simple hypotheses (Goodman 1999b). The Bayes factor in its simplest form – comparing two simple hypotheses – is the frequentist likelihood ratio (Goodman 1999b).
Decisions about the null are made using the Bayes factor. The median and the credibility interval for the effect size can be reported alongside the Bayes factor.
BPE involves only one model. It yields a posterior distribution of the relative credibility of all candidate parameter values including the null value under a single distribution; it also provides an estimate of the magnitude and uncertainty of the credible effect sizes.
The ‘highest density interval’ (HDI) of the posterior distribution is the range of parameter values that all have higher credibility than values outside the range. A 95% HDI is a probability range that includes 95% of the posterior distribution.
The HDI is examined to assess if it includes or excludes the null value. This is helped by placing a small interval of variability around the null value called the ‘region of practical equivalence’ (ROPE). The ROPE is a ‘decision threshold’ or range of values deemed to be of practical equivalence to the null value (Kruschke 2011). Its limits are set to include a range of values around the null value that the researcher is prepared to accept as of no practical importance to the research question, based on their experience of the research domain.
Once the ROPE is established, the decision rules are (Kruschke 2011):
1. If the HDI – which by convention is set at 95% – falls inside the ROPE, accept the null hypothesis.
2. If the HDI falls outside the ROPE, reject the null hypothesis.
3. If the HDI and ROPE overlap, the result is inconclusive.
Decisions about the null are assessed by comparing the HDI to the ROPE. The median and a 95% HDI for the effect size are typically reported. However, the discerning reader has the option to set their own ROPE for assessing published results.
Kruschke (2011) recommended that researchers wishing to use a Bayesian approach to analyse data should choose whichever of BMC and BPE best answers their research question. BMC is typically used to test whether an effect is present while BPE is used to assess the size of the effect under the assumption that it is present (van Doorn et al 2021). Hypothesis testing and estimation can also be applied in sequence – for example, researchers can first use BMC to ascertain if an effect is present; if it is, they can then use BPE to assess the size of the effect and report it as a credibility interval (van Doorn et al 2021).
BMC may also be preferential in specific situations, such as when the researcher’s main interest is to look for evidence favouring the null hypothesis. Its Bayes factor can provide stronger evidence in favour of the null model with smaller data sets than using BPE (Kruschke 2011).
However, BMC can generate Bayes factors that are sensitive to the choice of prior (Kruschke 2011). For example, too vague a prior can result in acceptance of the null model (Tendeiro and Kiers 2019). ‘Informed priors’ are therefore advocated (Vanpaemel and Lee 2012). The prior can be informed by data from previous similar studies, expert opinion or researcher’s belief (Quintana et al 2017). BPE is more robust (Kruschke 2011) and is less sensitive to different choices of priors (Tendeiro and Kiers 2019). Researchers can make use of the default priors available in the software packages SPSS and JASP. A Bayesian result is considered robust if the conclusions do not change across a range of different priors. The robustness of different priors can be assessed by using robust plots in JASP (Van Doorn et al 2021).
Box 1 provides resources offering further advice about Bayesian approaches, including which tests to use and how to apply them in available software.
• Goss-Samson et al (2020): Practical steps for Bayesian inference and a Bayes factor classification table
• JASP Team (2018): Helpful, worked out examples for NHST tests and their Bayesian counterparts, with hypothesis test results and credibility intervals for estimating effect sizes presented graphically
• Kruschke (2011): Bayesian parameter estimation using the ROPE plus HDI
• Kruschke (2021): Comprehensive guidelines for Bayesian reporting
• van Doorn et al (2021): Advice on appropriate methods applied in JASP for the credibility interval used in directional testing
• Wagenmakers et al (2018b): Worked examples for Bayesian independent t-test, correlation and ANOVA
Most research questions are framed as a hypothesis, with researchers wanting to know if their hypothesis is correct (Cohen 1994). Bayesian analysis focuses on the hypothesis and provides the ‘probability of the hypothesis, given the data’ – formally denoted as P(Hypothesis/Data). This directly answers the researcher’s question (Dienes 2011).
In contrast, NHST does not directly tell researchers if their hypothesis is correct (Cohen 1994). Instead, it focuses on the data and gives researchers the ‘probability of the data, given the hypothesis’ (Dienes 2011) – P(Data/Hypothesis). This inverse probability is interpreted as long-term.
Bayesian inference depends on observed data; NHST depends on observed data and data obtained from a hypothetical infinite repetition of the experiment (Wagenmakers 2007, van de Schoot et al 2014).
NHST’s P-value is the probability of the experimental data plus more extreme data, given that the null hypothesis is true (Goodman 1993, Wagenmakers et al 2018b). Therefore, it is a tail-area integral – the area under the null hypothesis sampling curve – that includes the test statistic for the observed data plus hypothetical unobserved data (Wagenmakers 2007). This means that the observed result is grouped with other outcomes that might have occurred in hypothetical repetitions of the experiment (Goodman 1999a) – thus the P-value depends on fictitious data that are more extreme than the observed data (Kruschke and Liddell 2018a).
The error-based, NHST paradigm (Goodman 1999a) controls error rates at acceptable levels (Dienes 2011) that apply to long-run, repeated experiments, not to individual experiments (Wagenmakers et al 2008). Its confidence interval for the effect-size is also based on long-term, hypothetical, repeated measurements (Hoekstra et al 2014, Wagenmakers et al 2018b).
Unlike NHST, Bayesian analysis incorporates prior information or beliefs into the inference process (Wagenmakers et al 2018b). This can mitigate against ‘false alarms’ also referred to as false discoveries - false positives by using the information in the prior, which moderates extreme claims in the analysis (Kruschke 2011).
BMC evaluates which of two competing hypotheses better predicts the observed data (Wagenmakers et al 2018b). It quantifies the relative evidence for the two competing hypotheses on a continuous scale and presents the result as a Bayes factor ratio (Quintana and Williams 2018).
Bayes factors can also provide three states of evidence (Dienes and Mclatchie 2018):
1. Evidence for the null hypothesis rather than the alternative hypothesis.
2. Evidence for the alternative hypothesis rather than the null hypothesis.
3. Not much evidence either way.
BPE’s posterior distribution reveals the relative credibility of all the candidate parameter values, including the null value. Comparisons can be directly viewed using the posterior distribution (Kruschke and Liddell 2018b), enabling researchers to evaluate the relative credibility of all the parameter values (Kruschke 2011).
In contrast, the P-value from NHST provides no measure of credibility in favour of the null hypothesis (Kruschke 2011), as it can only be used to reject the null hypothesis (Wagenmakers et al 2008). Therefore, NHST fails to quantify statistical evidence (Wagenmakers 2007), as it focuses on dichotomous decision-making (Cohen 1994) based on arbitrarily set cut-off points (Menyhart et al 2021). BMC and BPE both provide methods for accepting the null value (Kruschke 2011).
Rogue, unrepresentative data can cause some findings to be false, regardless of the paradigm applied (Ioannidis 2005). BMC and BPE have a moderate probability of falsely rejecting the null hypothesis when small sample sizes are used. Collecting more data and monitoring each datum as it is added reduces the probability of falsely rejecting the null. False alarms are also mitigated by the information in the prior (Kruschke 2011).
Comparisons of studies with Bayes factors showed that P-values overstated the evidence against the null hypothesis (Goodman 1999b, Wagenmakers 2007). Also, NHST inference depends on which hypothesis is labelled the null hypothesis. The criterion of P<.05 tolerates a 5% false alarm rate in decisions to reject the null value (Kruschke and Liddell 2018a).
Bayesian analysis is more conservative than NHST inference. For example, the independent t-test deemed by researchers as ‘borderline significant’ at the 5% cut-off point can be shown by Bayesian analysis – minimum Bayes factor – to have at least 13% probability for ‘no effect’ (Goodman 1999a).
Sequential data designs and ‘optimal stopping’ have practical and ethical benefits in clinical studies.
Bayesian sequential data analysis allows the researcher to monitor the evidence one datum at a time as the data accumulates (Wagenmakers et al 2016). The result is coherent, regardless of whether the data were analysed sequentially, one at a time or as a single batch (Wagenmakers et al 2018b).
NHST does not permit sequential data analysis for predetermined fixed-sample designs (Wagenmakers 2007). This is because rates of false positives can increase if testing occurs after every subsequent input of a datum (Kruschke 2011).
Bayesian analysis enables researchers to apply ‘optimal stopping’ (Edwards et al 1963, Rouder 2014) and have a variable number of participants (n) (Cumming 2014) – researchers can stop collecting data when the evidence is compelling or continue if it is weak. This has ethical benefits because a trial can be optimally stopped if there is strong evidence for or against an effect size, whereas the sample size is stated in advance and fixed in NHST’s fixed-n designs (Wagenmakers et al 2016).
An NHST sequential design can have a variable n (Cumming 2014). However, researchers then must use statistical approaches that aim to maintain a total overall α (Todd 2007).
Researchers simultaneously evaluating many hypotheses in NHST sub-group analysis must control for multiple comparisons with appropriate adjustments or potentially inflate rates of false positives (Menyhart et al 2021), especially in discovery-orientated research (Ioannidis 2005). However, Bayesian analysis can mitigate inflated rates of false positives more rationally through information in the prior than can intention-based corrections for multiple comparisons in NHST (Kruschke 2011). Researchers using Bayesian inference do not need special planning for multiple testing corrections or planned versus post-hoc comparisons (Dienes 2011). This is because Bayesian uses relative evidence whereas NHST uses P-values.
Bayesian analysis mitigates against false alarms in hierarchical – multilevel – models. For example, if a researcher is determining the effect size of a new intervention in a multi-hospital patient sample, they can add an extra higher level to the Bayesian analysis to account for the differences between hospitals. In Bayesian analysis the between-hospital effects which are extreme or unrepresentative are diminished through a process called ‘shrinkage’ or ‘partial pooling’. This involves moving point estimates and their corresponding intervals closer, rendering them more conservative (Kruschke and Liddell 2018b).
Repeat experiments of previous studies that use P-values have poor replication rates (Ioannidis 2005). There are many reasons for poor replication, including claims of conclusive findings that are based solely on a single study and small sample sizes resulting in studies having low statistical power (Ioannidis 2005). This ‘replication crisis’ was one of the reasons the American Statistical Association issued a policy statement (Wasserstein and Lazar 2016) outlining some consensus principles to improve the interpretation and avoid the misuse of P-values.
The use of Bayesian analysis is relevant to the replication crisis because it escapes the statistical shortcomings of the findings of a single study’s P-value being considered conclusive, since it manages the uncertainties in parameters and incorporates prior information (Gelman 2015).
BPE could also help the replication crisis by focusing on the precision and accuracy of parameter estimates, which could be used as criteria for publication instead of the NHST goal of accepting or rejecting a null value (Kruschke and Liddell 2018a).
NHST, when applied and interpreted correctly, is useful if the researcher’s focus is decision procedures based on long-run error frequencies – for example, ‘how often will I be wrong in the long-term?’ Unfortunately, NHST has methodological problems even when applied correctly (Cohen 1994, Malone and Coyne 2020). Its error-based paradigm does not yield evidence supporting the null hypothesis and can overstate the evidence against the null hypothesis (Goodman 1999b, Wagenmakers 2007).
Bayesian analysis merits inclusion in the methods researchers use to analyse data. The two Bayesian approaches discussed here are superior to NHST, as they provide informative inferences and lack its many limitations (Kruschke 2011, Kruschke and Liddell 2018a). Bayesian methods give researchers the ability to analyse data using relative evidence rather than relying on long-term error rates and binary decision procedures. Furthermore, the incorporation of prior information in Bayesian computation helps mitigate against false claims of discovery being made because of rogue, unrepresentative samples.
NHST and Bayesian analyses can be applied to the same data, but they are not complementary as they are distinctly different inferential tools (Burnham and Anderson 2014). Researchers should give Bayesian conclusions preference, when NHST and Bayesian conclusions differ (Dienes and Mclatchie 2018).
Bayesian analysis adds value to evidence-based decision-making in nursing and midwifery. A transition from NHST to the Bayesian approach may be achieved more effectively by making the results from both approaches available to the informed reader.
The biology of cancer
Cancer research is moving fast. Understanding of the biology...
Involving service users in an inclusive research project
This article describes how people with learning disabilities...
Credit where it is due: clients’ contribution to academic research
When the contributions of a group of people with autism and...
Enhancing service users’ capacity and capability
This article describes the implementation of an eight-week...
The social history of learning disability
Purple Patch Arts is an organisation based in Bradford, West...