Equivalence testing: reversed hypotheses, margins, and the need for controlling researcher allegiance

Falk Leichsenring; Allan Abbass; Mark Hilsenroth; Patrick Luyten; Sven Rabung; Christiane Steinert

doi:10.1017/S0033291718003380

Equivalence testing: reversed hypotheses, margins, and the need for controlling researcher allegiance

Published online by Cambridge University Press: 21 November 2018

Sven Rabung and

Falk Leichsenring*: Affiliation:
Department of Psychosomatics and Psychotherapy, Justus-Liebig-University Giessen, Ludwigstr. 76, 35392 Giessen, Germany
Allan Abbass: Affiliation:
Department of Psychiatry, Dalhousie University, Centre for Emotions and Health, Halifax, 8203 5909 Veterans Memorial Lane, Halifax, NS, Canada, B3H 2E2, Canada
Mark Hilsenroth: Affiliation:
Derner School of Psychology, Adelphi University, Hy Weinberg Center, 1 South Avenue, Garden City, NY 11530-0701, USA
Patrick Luyten: Affiliation:
Faculty of Psychology and Educational Sciences, University of Leuven, Klinische Psychologie (OE), Tiensestraat 102 – bus 3722, 3000 Leuven, Belgium Research Department of Clinical, Educational and Health Psychology, University College London, Gower Street, London WC1E 6BT, UK
Sven Rabung: Affiliation:
Department of Psychology, Alpen-Adria-Universität Klagenfurt, Universitätsstr. 65-67, A-9020 Klagenfurt, Austria
Christiane Steinert: Affiliation:
Department of Psychosomatics and Psychotherapy, Justus-Liebig-University Giessen, Ludwigstr. 76, 35392 Giessen, Germany Department of Psychology, MSB Medical School Berlin, Calandrellistr. 1-9, 12247 Berlin, Germany
*: Author for correspondence: Falk Leichsenring, E-mail: Falk.Leichsenring@psycho.med.uni-giessen.de

Article contents

Abstract
References

Rights & Permissions

Abstract

An abstract is not available for this content. As you have access to this content, full HTML content is provided on this page. A PDF of this content is also available in through the ‘Save PDF’ action button.

Type: Correspondence
Information: Psychological Medicine , Volume 49 , Issue 5 , April 2019 , pp. 876 - 878

DOI: https://doi.org/10.1017/S0033291718003380 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

The number of non-inferiority or equivalence trials is increasing in many areas of research (Piaggio et al., Reference Piaggio, Elbourne, Pocock, Evans and Altman2012). Thus, the specific statistical and methodological issues involved in these type of trials become more and more important. A recent correspondence article by Rief and Hofmann (Reference Rief and Hofmann2018a) suggests that several of these issues need some clarification.

In a reply to our comment on their article on non-inferiority testing (Leichsenring et al., Reference Leichsenring, Abbass, Driessen, Hilsenroth, Luyten, Rabung and Steinert2018a; Rief and Hofmann, Reference Rief and Hofmann2018b), Rief and Hofmann (Reference Rief and Hofmann2018a) reject our statement that they misinterpreted the results of the Steinert et al. meta-analysis (Steinert et al., Reference Steinert, Munder, Rabung, Hoyer and Leichsenring2017). They maintain that a significant disadvantage of psychodynamic therapy compared with other therapies was shown. However, statistical results cannot be detached from the scientific hypothesis tested, which is from their context conditions. Steinert et al. tested the hypothesis of equivalence of psychodynamic therapy to other treatments established in efficacy including CBT. In equivalence testing, the null and alternative hypothesis are reversed and a significant result indicates that the null hypothesis of non-equivalence (e.g. δ > 0.25) is rejected and equivalence can be concluded (Walker and Nowacki, Reference Walker and Nowacki2011). Thus, the p values (e.g. p = 0.016) reported by Steinert et al. (Reference Steinert, Munder, Rabung, Hoyer and Leichsenring2017) do not indicate a disadvantage of psychodynamic therapy as erroneously stated by Rief and Hofmann (Reference Rief and Hofmann2018a, p. 2) but equivalence. Equivalence was convincingly demonstrated since all confidence intervals were completely included within the pre-defined margin (δ) of equivalence (g = ± 0.25), one of the smallest margins ever proposed for demonstrating equivalence in psychotherapy research (Steinert et al., Reference Steinert, Munder, Rabung, Hoyer and Leichsenring2017).

Rief and Hofmann do not suggest a realistic and scientifically justified margin for testing equivalence, neither in their first paper (Rief and Hofmann, Reference Rief and Hofmann2018b) nor in their new article (Rief and Hofmann, Reference Rief and Hofmann2018a). In their recent paper (Rief and Hofmann, Reference Rief and Hofmann2018a, p. 2), they just claim that a therapy that is ‘10% or 20% less effective than the current gold standard’ cannot be recommended as a treatment, neither ethically nor scientifically. This statement is questionable for several reasons addressed in the following.

Rief and Hofmann (Reference Rief and Hofmann2018a) do not give any scientific justification for their proposal: The authors do not specify what they regard as compatible with equivalence, but state only implicitly what they regard as not compatible with equivalence, a difference of 10% or above. Why 10%, not 15% or 5%?

As an aside: Steinert et al. (Reference Steinert, Munder, Rabung, Hoyer and Leichsenring2017) found a difference between psychodynamic therapy and CBT in terms of Hedges’ g of 0.16, which corresponds to a difference in success rates of <9% (Kraemer and Kupfer, Reference Kraemer and Kupfer2006). Thus, it is below 10% which is apparently regarded by Rief and Hofmann as compatible with equivalence.

Furthermore, what would a statistically significant difference suggested by Rief and Hofmann (Reference Rief and Hofmann2018a) imply? Cohen (Reference Cohen1988) proposed to consider an effect size of 0.20 as small. Thus, 0.16 is clearly a small effect. In a meta-analysis, even very small effect sizes may become statistically significant due to its high statistical power. Such small differences may not be of any clinical significance (Turner et al., Reference Turner, Schunemann, Griffith, Beaton, Griffiths, Critch and Guyatt2010; McGlothlin and Lewis, Reference McGlothlin and Lewis2014). Rief and Hofmann (Reference Rief and Hofmann2018a) suggest that CBT is superior to psychodynamic therapy by an effect size of 0.16. This corresponds to a difference of about 1 scale point on the Hamilton Depression Rating Scale, for instance, a difference that can hardly be considered to be clinically important. Not only statistical significance, but also clinical significance needs to be taken into account when defining margins, reviewing results or giving treatment recommendations.

Furthermore, it is not clear what the ‘gold standard’ mentioned by Rief and Hofmann (Reference Rief and Hofmann2018a, p. 2) is, as response rates of most treatments for common mental disorders hover around 50% with no significant differences between types of psychotherapy (Cuijpers et al., Reference Cuijpers, Karyotaki, Weitz, Andersson, Hollon and van Straten2014). As we have argued before, there is much room for improvement of current treatments and we are far from having established a ‘gold standard’ treatment. Specifically the evidence base for CBT is less solid than previously thought if response and remission rates, comparative efficacy, evidence on mechanisms of change and study quality are taken into account (Leichsenring and Steinert, Reference Leichsenring and Steinert2017; Leichsenring et al., Reference Leichsenring, Abbass, Hilsenroth, Luyten, Munder, Rabung and Steinert2018b).

If the advantage of CBT is not more than about 1 scale point on the Hamilton Depression Rating Scale, for instance, or 0.16 in general despite many years of research and millions of dollars, pounds, and euros used to fund research of CBT (MQ, 2015), this is not very impressive and again hardly qualifies as a gold standard. At follow-up, the difference between psychodynamic therapy and CBT was still even smaller (g = −0.05, p = 0.0001, Steinert et al., Reference Steinert, Munder, Rabung, Hoyer and Leichsenring2017).

CBT research has been much more funded than research in other approaches: in the UK alone, CBT research was awarded 30.42 million pounds between 2008 and 2013 (MQ, 2015). In the same time interval, psychodynamic therapy was funded with only 1.53 million pounds (MQ, 2015), one-20th that of CBT. Thus, ‘equivalence’ in funding does not exist. If the starting conditions of treatments are that different, fair comparisons are hardly possible, neither in equivalence nor in superiority trials. For this reason, differences in funding should be paid more attention. As bias in funding decisions were found to be common, they need to be better controlled for (Nicholson and Ioannidis, Reference Nicholson and Ioannidis2012).

Furthermore, Rief and Hofmann (Reference Rief and Hofmann2018a) question the results of a recent equivalence meta-analysis (Steinert et al., Reference Steinert, Munder, Rabung, Hoyer and Leichsenring2017) by repeating their allegation (Rief and Hofmann, Reference Rief and Hofmann2018a, Reference Rief and Hofmannb) that this meta-analysis was biased as it was supported by a German psychodynamic society. However, in order to control for researcher allegiance, Steinert et al. included a prominent CBT researcher (JH) in this meta-analysis of psychodynamic therapy, thus establishing a form of adversarial collaboration. As stated in the original article and in a first reply to Rief and Hofmann, the sponsor did not influence the procedures, the results or the presentation of the Steinert et al. meta-analysis in any way (Steinert et al., Reference Steinert, Munder, Rabung, Hoyer and Leichsenring2017; Leichsenring et al., Reference Leichsenring, Abbass, Driessen, Hilsenroth, Luyten, Rabung and Steinert2018a). Thus, it is not clear why Rief and Hofmann keep on repeating their false allegation. Until now, Rief and Hofmann have not yet included a prominent psychodynamic researcher in any of their studies to balance any implicit or explicit researcher biases. We invite them to do so.

As research in psychodynamic therapy is rarely funded by public funding organizations not only in the UK but also by the NIMH in the USA or the DFG in Germany, psychodynamic researchers are in need for other sponsors – which is then turned against them by CBT advocates, for example, by Rief and Hofmann (Reference Rief and Hofmann2018a) – a vicious circle.

Finally, Rief and Hofmann (Reference Rief and Hofmann2018a, p. 2) claim that a therapy that is ‘10% or 20% less effective than the current gold standard’ cannot be recommended as a treatment, neither ethically nor scientifically. However, this claim would only be justified if there was another therapy with (close to) 100% efficacy. Response rates for the different forms of psychotherapy including CBT, which seems to be regarded as the gold standard by Rief and Hofmann (Reference Rief and Hofmann2018a), are about 50% and rates for remission are even lower (Cuijpers et al., Reference Cuijpers, Karyotaki, Weitz, Andersson, Hollon and van Straten2014; Loerinc et al., Reference Loerinc, Meuret, Twohig, Rosenfield, Bluett and Craske2015). Thus, many patients do not sufficiently benefit from one specific psychotherapeutic approach. They may, however, benefit from another approach, e.g. from interpersonal therapy, systemic therapy, or psychodynamic therapy. The chance of being helpful for the other half of patients should outweigh a small difference in effect sizes such as 0.16. For this reason, recommending only one evidence-based approach – due to a possible advantage of g = 0.16 – and excluding others is not justified, neither ethically nor scientifically. Treatment recommendations need to take the whole complexity of evidence into account instead of focusing on one isolated statistical parameter only.

These considerations suggest the following conclusions for equivalence testing in particular and for research in general.

• Equivalence margins need to be explicitly formulated and scientifically justified, taking the conditions of the respective disorder into account.
• Not only statistical significance, but also clinical significance needs to be taken into account when defining margins, reviewing results, or giving treatment recommendations.
• Researcher allegiance needs to be taken more explicitly into account, in equivalence trials, but also in superiority trials and meta-analyses, as well as in treatment guideline committees, for example, by establishing forms of adversarial collaboration.
• Biases in funding decisions are common and need to be better controlled for, too.
• For fair comparisons between treatments in equivalence and superiority trials, the different treatments need to be more equally supported by funding organizations.
• For the sake of our patients, funding in psychotherapy needs to be distributed more broadly. Patients who do not benefit from one evidence-based approach may benefit from another. A diversity of equally funded and evidence-based treatments is a strength, a psychotherapy monoculture is not.
• Treatment recommendations need to be balanced, taking the whole complexity of evidence into account. Focusing on one isolated statistical parameter only is not justified, neither ethically nor scientifically.

Conflict of interest

None.

References

Cohen, J (1988) Statistical Power Analysis for the Behavioral Sciences. New York: Lawrence Erlbaum Hillsdale.Google Scholar

Cuijpers, P, Karyotaki, E, Weitz, E, Andersson, G, Hollon, SD and van Straten, A (2014) The effects of psychotherapies for major depression in adults on remission, recovery and improvement: a meta-analysis. Journal of Affective Disorders 159, 118–126.Google Scholar

Kraemer, HC and Kupfer, DJ (2006) Size of treatment effects and their importance to clinical research and practice. Biological Psychiatry 59, 990–996.Google Scholar

Leichsenring, F and Steinert, C (2017) Is cognitive behavioral therapy the gold standard for psychotherapy? The need for plurality in treatment and research. JAMA 318, 1323–1324.Google Scholar

Leichsenring, F, Abbass, A, Driessen, E, Hilsenroth, M, Luyten, P, Rabung, S and Steinert, C (2018 a) Equivalence and non-inferiority testing in psychotherapy research. Psychological Medicine 48, 1917–1919.Google Scholar

Leichsenring, F, Abbass, A, Hilsenroth, MJ, Luyten, P, Munder, T, Rabung, S and Steinert, C (2018 b) ‘Gold standards’, plurality and monocultures: the need for diversity in psychotherapy. Frontiers in Psychiatry 9, 159.Google Scholar

Loerinc, AG, Meuret, AE, Twohig, MP, Rosenfield, D, Bluett, EJ and Craske, MG (2015) Response rates for CBT for anxiety disorders: need for standardized criteria. Clinical Psychology Review 42, 72–82.Google Scholar

McGlothlin, AE and Lewis, RJ (2014) Minimal clinically important difference: defining what really matters to patients. JAMA 312, 1342–1343.Google Scholar

MQ (2015) MQ Landscape Analysis. UK Mental Health Research Funding.Google Scholar

Nicholson, JM and Ioannidis, JP (2012) Research grants: conform and be funded. Nature 492, 34–36.Google Scholar

Piaggio, G, Elbourne, DR, Pocock, SJ, Evans, SJ and Altman, DG (2012) Reporting of noninferiority and equivalence randomized trials: extension of the CONSORT 2010 statement. JAMA 308, 2594–2604.Google Scholar

Rief, W and Hofmann, SG (2018 a) The limitations of equivalence and non-inferiority trials. Psychological Medicine, 1–2. https://doi.org/10.1017/S003329171800289.Google Scholar

Rief, W and Hofmann, SG (2018 b) Some problems with non-inferiority tests in psychotherapy research: psychodynamic therapies as an example. Psychological Medicine 48, 1392–1394.Google Scholar

Steinert, C, Munder, T, Rabung, S, Hoyer, J and Leichsenring, F (2017) Psychodynamic therapy: as efficacious as other empirically supported treatments? A Meta-Analysis Testing Equivalence of Outcomes. American Journal of Psychiatry 174, 943–953.Google Scholar

Turner, D, Schunemann, HJ, Griffith, LE, Beaton, DE, Griffiths, AM, Critch, JN and Guyatt, GH (2010) The minimal detectable change cannot reliably replace the minimal important difference. Journal of Clinical Epidemiology 63, 28–36.Google Scholar

Walker, E and Nowacki, AS (2011) Understanding equivalence and noninferiority testing. Journal of General Internal Medicine 26, 192–196.Google Scholar

Article contents

Equivalence testing: reversed hypotheses, margins, and the need for controlling researcher allegiance

Abstract

Conflict of interest

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests