Psychological treatments can provide patients with effective means of overcoming mental distress and increasing their well-being (McHugh and Barlow, Reference McHugh and Barlow2010). Research on the efficacy of evidence-based approaches, such as cognitive behaviour therapy (CBT), suggest that a large number of patients improve thanks to the interventions they receive (Hofmann et al., Reference Hofmann, Asnaani, Vonk, Sawyer and Fang2012). However, not everyone seems to benefit, with only half of the patients being regarded as responders at post-treatment and follow-up (Loerinc et al., Reference Loerinc, Meuret, Twohig, Rosenfield, Bluett and Craske2015). Similarly, several investigations suggest that a small proportion of all patients deteriorate during treatment (Boswell et al., Reference Boswell, Kraus, Miller and Lambert2015). For instance, Hansen et al. (Reference Hansen, Lambert and Forman2002) found that 8.2% fared worse in routine outpatient care, which can be compared with 6.9% within a psychiatric population (Mechler and Holmqvist, Reference Mechler and Holmqvist2016), and 5.8% in clinical trials of internet-based CBT (Rozental et al., Reference Rozental, Magnusson, Boettcher, Andersson and Carlbring2017).
Non-response and deterioration are, however, largely determined using different statistical procedures, cut-offs, or diagnostic criteria. Yet other negative effects might occur, but have thus far gained less attention (Castonguay et al., Reference Castonguay, Boswell, Constantino, Goldfried and Hill2010). A major problem in relation to exploring such cases is the fact that no reliable instrument has existed, making it difficult to investigate their incidence and characteristics. Strupp and Hadley (Reference Strupp and Hadley1977) suggested that stigma, dependency and novel symptoms could occur and affect the patient negatively, but the means for their assessment were not discussed. A comprehensive rating system for videotaped sessions was later proposed, the Vanderbilt Negative Indicators Scale (VNIS; Suh et al., Reference Suh, Strupp, O’Malley and Greenberg1986), but it never gained widespread attention. Recently, attempts have instead been made to investigate negative events from the perspective of the patient, most notably the Inventory for the assessment of Negative Effects of Psychotherapy (INEP; Ladwig et al., Reference Ladwig, Rief and Nestoriuc2014), and the Experiences of Therapy Questionnaire (ETQ; Parker et al., Reference Parker, Fletcher, Berk and Paterson2013). However, these instruments have not yet been used to a great extent for patients undergoing treatment. Furthermore, some of their items seem hard to apply in some settings, e.g. insurance issues do not target negative effects explicitly, or are more related to malpractice than unintended effects of evidence-based approaches.
Rozental et al. (Reference Rozental, Kottorp, Boettcher, Andersson and Carlbring2016) thus developed a novel instrument to assess negative effects in psychological treatments to overcome some of the previous shortcomings. Using the results from a study on negative effects in a clinical trial on social anxiety disorder (Boettcher et al., Reference Boettcher, Rozental, Andersson and Carlbring2014), a consensus statement among researchers (Rozental et al., Reference Rozental, Andersson, Boettcher, Ebert, Cuijpers and Knaevelsrud2014), a qualitative analysis of the responses to open-ended questions among patients in treatment (Rozental et al., Reference Rozental, Boettcher, Andersson, Schmidt and Carlbring2015), and a literature review, items for the Negative Effects Questionnaire (NEQ) were generated and investigated using an exploratory factor analysis. The resulting instrument consists of 32 items and six factors, explaining a total variance of 57.6%: symptoms, quality, dependency, stigma, hopelessness, and failure. Findings also indicate that symptoms accounted for 36.6%, possibly being the most important factor in terms of negative effects in psychological treatments, such as ‘I had more problems with my sleep’ (Item 1). Furthermore, one-third of the participants reported having experienced unpleasant memories, stress and anxiety, suggesting that these incidents could be fairly common in treatment.
However, an exploratory factor analysis assumes that items are scored on an interval or quasi-interval scale, in line with classical test theory (Wright, Reference Wright1977). This presumes that all items are equally difficult for the respondent, or person, to complete, which might not be the case in reality. In addition, it does not allow a separation between persons and items, that is, to assess not only how well each item fits the underlying construct, but also the person’s response patterns. This can be helpful for identifying abnormal responses that might warrant further development of the instrument (Andrich, Reference Andrich1978). Rasch analysis, on the other hand, which is based on modern test theory, can be applied in order to analyse ordinal data in a way that provides linear measures, thereby addressing some of the caveats of classical test theory (Wright, Reference Wright1996). This generates estimates of reliability and validity both for persons and items, making it feasible to study the instrument with greater depth (Waugh and Chapman, Reference Waugh and Chapman2004). Furthermore, Rasch analysis can specifically be applied to test the dimensionality of an instrument, which, in the case of negative effects, can be assumed to be unidimensional, i.e. forms a single underlying construct. This seems plausible from a theoretical point of view, that is, negative effects should constitute one type of outcome related to psychological treatments, yet has never been tested previously.
The purpose of the current study was therefore to use Rasch analysis to further examine the NEQ, based on data from patients having completed the instrument at post-treatment in five clinical trials (N = 564). This has not been applied before in relation to negative effects, something that could shed some light on the psychometric properties of an instrument that might become useful in clinical and research settings. In particular, such a method makes it possible to detect item bias and to explore whether each item performs in a comparable way across sociodemographics. A similar study was conducted using Rasch analysis for the Depression, Anxiety and Stress Scales (Lovibond and Lovibond, Reference Lovibond and Lovibond1995), suggesting that a number of items could be removed and that it was not supported as a general instrument for mental distress (Shea et al., Reference Shea, Tennant and Pallant2009).
The aim of the current study was thus twofold: to explore the response categories of the NEQ to see if they are of incremental scale steps, i.e. 0–4, and to examine the response pattern and goodness-of-fit between persons and items. The overall objective was to determine the usefulness of the NEQ as a way of exploring negative effects in psychological treatments.
Participants were recruited from five clinical trials of spider phobia, perfectionism, social anxiety disorder, and loneliness (N = 564). Each case involved self-referrals and the studies were advertised in Sweden via national and regional newspapers and radio shows, social media, posters and flyers. A complete overview of the sociodemographics and clinical variables at pre-treatment is given in Table 1. Because not every clinical trial requested the same type of information from the participants, there was some degree of systematic missing data, e.g. living with someone, prior psychological treatment, and prior or ongoing psychotropic medication. Also, due to publication issues, symptom severity was not possible to present for one of the clinical trials. In addition, one of the clinical trials was included as part of the exploratory factor analysis of the NEQ, i.e. social anxiety disorder (n = 189) (Rozental et al., Reference Rozental, Kottorp, Boettcher, Andersson and Carlbring2016).
a Category not applicable in n = 3;
b based on n = 561;
c based on n = 405;
d based on n = 518.
Treatment and therapists
The psychological treatments that were administered in the clinical trials consisted of CBT, delivered in various formats: face-to-face, virtual reality, and via the internet, with or without guidance from a therapist, or by support on demand (Andersson et al., Reference Andersson, Carlbring, Hadjistavropoulos, Hofmann and Asmundson2017). The therapists were masters degree students having undergone basic clinical training or more experienced therapists in advanced clinical training (i.e. psychotherapists in training). As for the internet conditions, participants received weekly modules consisting of both reading material and exercises to be completed by the participants every week, comparable to a self-help book (Andersson, Reference Andersson2016). The psychological treatments ranged from one session to 9 weeks; shortest for spider phobia and longest for social anxiety disorder.
The participants filled out their sociodemographics and several outcome measures during the recruitment process before being assessed for eligibility. This was performed on a secure online interface using an auto generated identification code, such as 1234abcd, thereby ensuring anonymity and minimizing data loss (Vlaescu et al., Reference Vlaescu, Alasjö, Miloff, Carlbring and Andersson2016). Upon completing their treatment, the participants answered the outcome measures again, with the addition of the NEQ (Rozental et al., Reference Rozental, Kottorp, Boettcher, Andersson and Carlbring2016). The only exception was the clinical trial of spider phobia where paper and pencil was used.
The Negative Effects Questionnaire. The NEQ was developed by Rozental et al. (Reference Rozental, Kottorp, Boettcher, Andersson and Carlbring2016) with the aim of investigating the occurrence and characteristics of negative effects in psychological treatments. The process of developing the instrument is described in detail in the original study. The exploratory factor analysis resulted in a rotated factor-solution with 32 items and the following six factors: symptoms, quality, dependency, stigma, hopelessness, and failure. The NEQ was found to have a good internal consistency, α for the full instrument .95, range .72 to .93 for the six separate factors. The instrument also consists of one open-ended question in order to capture other negative effects that are not included among the items, but this was not explored in the current study.
Each clinical trial included in the current study distributed a primary outcome measure selected by relevance, for instance the Spider Phobia Questionnaire (Muris and Merckelbach, Reference Muris and Merckelbach1996). Several secondary outcome measures were also administered; the Patient Health Questionnaire – 9 items (PHQ-9; Löwe et al., Reference Löwe, Kroenke, Herzog and Gräfe2004), the Generalized Anxiety Disorder – 7 items (GAD-7; Spitzer et al., Reference Spitzer, Kroenke, Williams and Löwe2006), and the Brunnsviken Brief Quality of Life Scale (BBQ; Lindner et al., Reference Lindner, Frykheden, Forsström, Andersson, Ljótsson and Hedman2016). These are, however, only presented descriptively in Table 1 for an overview of the sample.
In order to investigate and evaluate the validity of the internal structure and response processes of the NEQ, Rasch analysis was applied, following the same steps as described in Lerdal et al. (Reference Lerdal, Kottorp, Gay, Aouizerat, Lee and Miaskowski2016). The software WINSTEPS, version 22.214.171.124, was used for all analyses, implementing a rating scale model as all of the items in the NEQ are scored on a similar rating scale category. Rasch analysis converts the patterns of raw scores from the NEQ into item and person equal-interval measures simultaneously, using a logarithmic transformation of the odds probabilities of the responses (Bond and Fox, Reference Bond and Fox2013). This converted item measure is then applied to determine whether they are scored on a similar unidimensional construct, which is often viewed as crucial in terms of validity in both classical and modern test theory (Spector, Reference Spector1992). In a similar manner, the converted person measure is utilized to evaluate person response validity and the precision of the scale.
The psychometric properties of the NEQ rating scale categories were initially examined using the following criteria: (a) minimum of 10 responses per step category, (b) the average measures for each step category should advance monotonically, and (c) outfit Mean Square (MnSq) values less than 2.0 for the step category calibrations (Linacre, Reference Linacre2002). If these criteria were not initially met, actions to collapse rating scale categories or deletion of categories would be initiated, in line with the literature (Linacre, Reference Linacre2004).
Evidence of internal structure of the NEQ was then further investigated by monitoring the item goodness-of-fit statistics. WINSTEPS generates both MnSq residuals and standardized z-values for each of the items of the NEQ. The goodness-of-fit statistics indicate the degree of match between actual responses on the items and expected responses from the Rasch model assertions (Bond and Fox, Reference Bond and Fox2013). Goodness-of-fit was evaluated by infit statistics, as they are viewed as more sensitive to item performance and also more informative when exploring internal scale validity (Wright and Masters, Reference Wright and Masters1982; Bond and Fox, Reference Bond and Fox2013). Furthermore, the MnSq fit statistic is preferable for item goodness-of-fit with polytomous data as it is less sensitive to sample size (Smith et al., Reference Smith, Rush, Fallowfield, Velikova and Sharpe2008). The current study chose a sample-size adjusted criterion for item goodness-of-fit set for infit MnSq values between 0.7 and 1.3 for the NEQ (Smith et al., Reference Smith, Rush, Fallowfield, Velikova and Sharpe2008). If one or more items would not demonstrate acceptable goodness-of-fit to the model, the items would be removed from the analysis and the iteration process would be repeated until all items met the criterion of acceptable goodness-of-fit.
In order to evaluate the unidimensionality of the NEQ, a principal component analysis of the residuals was also performed (Linacre, Reference Linacre2005). The criterion for unidimensionality was that at least 50% of the total variance should be explained by the first latent variable (Raîche, Reference Raîche2005), and that no more than 5% should be explained by the largest secondary dimension with an associated eigen value of 2.0, which is an indication of lack of multi-dimensionality.
Evidence of person response validity was then evaluated by monitoring the person goodness-of-fit statistics. The criterion for evaluating person goodness-of-fit was to reject infit MnSq values >1.4 associated with a z-value >2. It was also accepted that 5% of the sample may fail to demonstrate acceptable goodness-of-fit by chance, without a serious threat to validity (Patomella et al., Reference Patomella, Tham and Kottorp2006).
In order to monitor the precision of the converted measures, the person and item separation indices were calculated (Fisher, Reference Fisher1992). The person separation index reflects the number of statistically different strata that the test can identify in the sample, considering the range and precision of the individual person and item estimates. In a similar way, the item separation index reflects the number of statistically different strata that the sample can identify among the items. An index above 1.5 would ensure that the NEQ could differentiate at least two different groups in the sample/among the items.
Finally, a number of Differential Item Functioning (DIF) analyses were performed in order to explore the stability of the response patterns of the NEQ items across sociodemographics, giving further support of validity in relation to internal structure and potential unfairness in testing. This was conducted because it is crucial that an instrument is not biased with regard to any sociodemographics that may otherwise compromise the converted measures, question the validity of the instrument, and influence the interpretation of subsequent findings. The magnitude of DIF was evaluated using the Mantel-Haenszel statistic for polytomous scales using log-odds estimators (Mantel, Reference Mantel1963).
Overall response pattern
Prior to evaluating the categorical responses from the NEQ, all the criteria were met. All rating scale categories were used, which advanced monotonically, and the outfit MnSq values for the step category calibrations ranged from 0.89 to 1.11. Only 281 participants out of 564 scored any of the items of the NEQ, and a total of 86% of the person-item data matrix were non-responses, i.e. empty cells (see Table 1). The following item and person validity analyses was thus performed with a limited number of data records, as only 50.9% of the sample reported to have experienced any negative effect of their psychological treatment.
The first iteration generating item goodness-of-fit statistics for the 32 items revealed that six items did not meet the criterion for item goodness-of-fit (see Table 2). By removing these items, the next iteration revealed that an additional four items did not meet the criterion and were thus removed. In the third iteration, two more items were removed. Hence, after the third iteration and the removal of 12 items in total (37.5%), the remaining 20 items on the NEQ demonstrated acceptable item fit to the Rasch model assertions. For an overview of the frequencies and average negative impact of each item in the final scale, see Table 3.
a Based on the number of patients reporting any type of negative effect caused by treatment, N = 281.
Principal component analysis
Following the removal of the twelve items demonstrating misfit, the principal component analysis revealed that the first component explained 62.5% of the total variance, which exceeded the criterion of at least 50% required in order to establish unidimensionality (see Table 2). The second dimension explained an additional 6.3 associated with an eigen value of 3.37, which surpasses the criteria set. By monitoring the item residual loadings, items 15, 11, 3 and 1 loaded more strongly on one component, while items 18, 4, 16, 12 and 20, however, loaded more strongly on another (see Table 4).
Person response validity
When evaluating the person response validity, twelve of the 264 participants (4.6%) did not demonstrate acceptable goodness-of-fit to the Rasch model in their responses to the NEQ, which met the criterion of up to 5%. Number of participants providing maximum and minimum scores are reported in Table 2.
The person-separation index for the original version of the NEQ, i.e. with 32 items, was 0.89. Moreover, the item-separation index (N = 281) was 2.01. After deletion of the 12 NEQ items demonstrating misfit to the Rasch model, the person-separation index increased to 1.08, and the item-separation index scale (N = 264) to 2.61 (see Table 2).
The DIF analyses revealed that all of the 20 remaining items of the NEQ functioned in a similar manner across sociodemographics (see Table 2), supporting fairness in testing.
The person-item map is presented in Fig. 1. Items reflecting negative effects more frequently experienced by the sample are placed at the lower end of the continuum, and items reflecting negative effects less frequently experienced by the sample are placed at the higher end of the continuum. In a similar way, participants with fewer experiences of negative effects are placed at the lower end of the continuum, and participants with more experiences of negative effects are placed at the higher end of the continuum.
The current study is the first to examine the psychometric properties of an instrument for determining negative effects of psychological treatments using Rasch analysis. In contrast to prior investigations, which have relied on classical test theory (Ladwig et al., Reference Ladwig, Rief and Nestoriuc2014; Parker et al., Reference Parker, Fletcher, Berk and Paterson2013; Rozental et al., Reference Rozental, Kottorp, Boettcher, Andersson and Carlbring2016), this has enabled an additional investigation of the reliability and validity of persons and items (Waugh and Chapman, Reference Waugh and Chapman2004), providing a more comprehensive understanding of how negative effects might be assessed. The results suggest that the NEQ exhibits fairness in testing, i.e.it does not demonstrate any bias in terms of the participants’ sociodemographics. This important finding suggests that the instrument should yield comparable measures across respondents regardless of gender, age, civil status, educational level, and type of employment, as items are functioning in a similar manner. Also, out of the original 32 items of the NEQ, 12 could be removed as they did not meet the criterion for goodness-of-fit, resulting in a final scale of 20 items that can be downloaded and used for free in clinical and research settings: www.neqscale.com. Reviewing these items indicate that the factor failure is no longer included in the instrument, which may be explained by the fact that it explained less than 3% of the variance in Rozental et al. (Reference Rozental, Kottorp, Boettcher, Andersson and Carlbring2016). From a theoretical perspective, it is also uncertain if failure reflects a poor outcome rather than actually experiencing these negative effects during treatment, making it reasonable to exclude the items belonging to this factor from the NEQ. As for the rest of the items that were removed from the instrument, these were primarily related to dependency and quality, and to a lesser extent hopelessness and symptoms. Albeit not as clear, it could be argued these items are unrelated to the underlying construct of negative effects or that there is a considerable overlap between them and the items that were retained. For instance, ‘I did not have confidence in my treatment’ (Item 24) may possibly capture the same concept as ‘I felt that my expectations for the treatment were not fulfilled’ (Item 27), the latter being excluded. However, it is also important to note that although 12 items did not demonstrate acceptable fit to the Rasch measurement model, indicating that these items demonstrated more unexpected variations in their scores in order to contribute to one underlying measureable construct, they may still add important information about negative effects. Still, it seems that a final scale of 20 items is reliable and could be easier to administer compared with the total scale of 32 items, which should help researchers and therapists to monitor negative effects on a more regular basis.
The rating scale of the instrument also seems to function equally across items, i.e. advancing monotonically, suggesting that the incremental steps of 0 to 4 are appropriate. Several item residuals did, however, load on two components, implying possible multi-dimensionality in the instrument. In relation to the factors obtained by Rozental et al. (Reference Rozental, Kottorp, Boettcher, Andersson and Carlbring2016), the first component is associated with symptoms, while the other is linked to four separate factors. The reason for this finding is unclear and prior research has not discussed the dimensionality or hierarchy of negative effects. Nonetheless, one plausible explanation could be that it reflects a distinction between the subjective experiences of incidents occurring during treatment, e.g. more anxiety, and implications that are interpersonal or social in character, such as dependency and stigma. Strupp and Hadley (Reference Strupp and Hadley1977) considered this issue in their tripartite perspective of psychological treatments, proposing that positive and negative effects might be judged differently by the patient, the therapist, and significant others. Another explanation may be that some negative effects are short term, as in experiencing more unpleasant feelings during treatment, while others are long term, as in believing things cannot improve. This notion has been raised by Castonguay et al. (Reference Castonguay, Boswell, Constantino, Goldfried and Hill2010), pointing to the fact that some interventions will never be perceived as particularly pleasant to the patient, even though they are seen as beneficial in the long run. Differentiating those negative effects that are enduring from those that are transient is thus an important research endeavour, preferably by assessing such instances both during treatment and at long-term follow-up. Future studies should also explore if other approaches to examine the instrument (e.g. multi-dimensional Rasch modelling) could yield additional and better solutions to measure negative effects. Still, the findings from the current study indicate that a majority of the items function well enough together to explain a large proportion of the variance and that they also yield acceptable person-fit statistics, which is an important aspect of measurement validity. Given that research on negative effects of psychological treatments is still a fairly new and unexplored territory, psychometric issues such as multi-dimensionality are nevertheless important to consider in order to move the field forward.
As for the rate of negative effects, the number of participants reporting negative effects in the current study was 50.9%, consistent with 58.7% among patients in a psychiatric setting who responded to the INEP (Rheker et al., Reference Rheker, Beisel, Kräling and Rief2017). However, this number varies significantly between investigations, with rates as high as 92.9% among patients with obsessive-compulsive disorder that were assessed with the Side-effects of Psychotherapy Scale in a study by Moritz et al. (Reference Moritz, Fieker, Hottenrott, Seeralan, Cludius and Kolbeck2015), and as low as 5.2% in national survey by Crawford et al. (Reference Crawford, Thana, Farquharson, Palmer, Hancock and Bassett2016) probing for ‘lasting bad effects from the treatment’. Hence different types of assessments and patients will generate different ratios, making it difficult to determine which estimate is more accurate and to compare it across investigations. One of the advantages of implementing Rasch analysis is, however, the possibility to go beyond just frequencies or levels of symptoms, adjusting for both aspects within a sample. In other words, a person experiencing a large impact on a limited number of items that are rarely perceived among the sample will generate a higher measure of negative effects, compared with a person who is experiencing a moderate impact on a larger number of items that are more often experienced among the sample. Taking this into account, the results from the current study suggest that the instrument has an acceptable person goodness-of-fit, but that it is inappropriate for differentiating distinct subgroups with regard to their experiences of negative effects. This is caused by a relatively large individual standard error associated with each individual measure, as most participants only endorsed a limited number of items (see Table 4). The NEQ is therefore restricted in detecting changes or differences within a specific sample based on their person measures, but is probably suitable for examining differences between clinical trials, settings or interventions by monitoring item difficulty calibrations, e.g. the rate and impact of a particular item, such as placement along the continuum. Further research will help to provide a better estimate of negative effects in psychological treatments by administering the NEQ on a more regular basis during the treatment period, but also to include it in more diverse patient populations.
There are some limitations that need to be addressed when interpreting the results. First, even though the sample was relatively heterogeneous with regard to their sociodemographics, the inclusion and exclusion criteria of each clinical trial may have affected the generalizability of the findings. A majority of the participants were female, middle-aged, in a relationship, having a university degree, and either students or employed, which may have affected the negative effects that were reported. Second, in terms of symptom severity at pre-treatment, the participants were, on average, sub-clinical, at least with regard to the PHQ-9, the GAD-7 and the BBQ, suggesting that they were relatively high functioning. It is possible that another sample would have responded differently on the NEQ, for instance patients with more severe psychiatric disorders than those included in the current study, such as personality disorders, recurrent depression, or eating disorders. Third, given that the participants received psychological treatments mostly administered via the internet or virtual reality, it might be that other formats or theoretical orientations than CBT could result in different negative effects, hence affecting such issues as what items to retain and the principal component analysis. Distributing the NEQ to even more diverse patient populations is thus needed and important to fully understand the occurrence and characteristics of negative effects of psychological treatments. Thus, until further research has been made, some caution is warranted in terms of interpreting the results from the 20 item-version of the NEQ in other formats than the internet, as well as for patients with more severe psychiatric disorders. Fourth, distributing an instrument to patients on a single occasion is problematic, especially concerning incidents that may have been experienced as negative by patients (Rozental et al., Reference Rozental, Andersson, Boettcher, Ebert, Cuijpers and Knaevelsrud2014). It is possible that the negative effects that were reported were affected by recall bias, primacy-recency effects, and social desirability (Krosnick, Reference Krosnick1999), resulting in less valid responses. Future research could therefore include the NEQ on at least one more occasion during the treatment period, for instance at mid-assessment. This should also be accompanied by an investigation of its relationship with outcome, i.e. whether or such incidents affect the long-term benefits.
The authors would like to thank the Swedish Association of Behaviour Therapy (SABT) for a travel grant that allowed A.R. to visit A.K. at the University of Illinois at Chicago to perform the data analysis.
This study was made possible in part by a travel grant by the Swedish Association of Behaviour Therapy (SABT). However, the funder had no role in the analyses or drafting of the manuscript.
Conflicts of interest
The authors have no conflicting interests to report.
The current study adheres to the ethical principles of psychologists and the code of conduct of the American Psychological Association. Ethical approval for the clinical trials was granted at each study location via their respective Regional Ethical Boards, and informed consent was collected from all participants.