Introduction
Common mental disorders (CMDs) such as depression, anxiety, and posttraumatic stress disorder (PTSD) affect millions of people globally. A meta-analysis across 39 countries indicated a lifetime prevalence of 29.2%, although this estimate varies across subgroups (Demyttenaere et al., Reference Demyttenaere, Bruffaerts, Posada-Villa, Gasquet, Kovess, Lepine and Al2004; Steel et al., Reference Steel, Marnane, Iranpour, Chey, Jackson, Patel and Silove2014). Particularly high prevalence rates have been estimated for specific populations, such as refugees and asylum seekers (Steel et al., Reference Steel, Chey, Silove, Marnane, Bryant and Van Ommeren2009; Charlson et al., Reference Charlson, van Ommeren, Flaxman, Cornett, Whiteford and Saxena2019). Some disorders may be more prevalent because of specific circumstances or group characteristics, however these differences could also reflect the performance of questionnaires across cultures (Gureje and Stein, Reference Gureje and Stein2012).
There is a large variety of brief, self-report screening instruments for symptoms of CMDs, such as the Hopkins Symptoms Checklist (HSCL), the Hospital Anxiety and Depression Scale (HADS), and the PTSD Checklist (PCL). Brief instruments can be useful for routine screening in primary and stepped care (Kagee et al., Reference Kagee, Tsai, Lund and Tomlinson2013; Olin et al., Reference Olin, Mccord, Kerker, Weiss, Hoagwood and Horwitz2017), especially where the application of time-consuming, clinician-administered structured interviews is not feasible, such as in low-resource settings (Kohrt et al., Reference Kohrt, Jordans, Tol, Luitel, Maharjan and Upadhaya2011). Furthermore, the ease of administration of most self-report measures makes them attractive for use in research (Kagee et al., Reference Kagee, Tsai, Lund and Tomlinson2013). However, these instruments are usually developed and evaluated in specific (Western, Anglo-Saxon) settings (Saxena et al., Reference Saxena, Paraje, Sharan, Karam and Sadana2006; Ali et al., Reference Ali, Ryan and DeSilva2016), while psychometric properties may vary across settings, cultures, and languages. For example, in a study on the validity of the HSCL-25 in Lebanon, the optimal cut-off score for anxiety and depression was found to be higher (2.00–2.10) than the widely accepted threshold of 1.75 (Mahfoud et al., Reference Mahfoud, Kobeissi, Peters, Araya, Ghantous and Khoury2013). This example illustrates the importance of cross-cultural validation of screening tools. The use of thresholds determined in other populations may lead to misclassification and misinterpretation (Steel et al., Reference Steel, Chey, Silove, Marnane, Bryant and Van Ommeren2009). However, literature on the psychometric properties of screening instruments in cultural contexts outside those for which they were developed is limited (Mutumba et al., Reference Mutumba, Tomlinson and Tsai2014; Carroll et al., Reference Carroll, Hook, Perez, Denckla, Vince, Ghebrehiwet, Ando, Touma, Borba, Fricchione and Henderson2020; Donnelly and Leavey, Reference Donnelly and Leavey2021).
The ability of a questionnaire (‘index test’) to identify individuals with a CMD compared to individuals without a disorder is called diagnostic accuracy (Leeflang et al., Reference Leeflang, Deeks, Takwoingi and Macaskill2013). Diagnostic accuracy is determined by comparing the outcomes of the index test with the outcomes of a reference standard in the same research subjects. The reference standard is regarded as the best available method to establish the presence or absence of the target condition (Rutjes, Reference Rutjes2017). A (semi-structured) clinical interview is the standard for diagnosing mental disorders in clinical practice and mental health research (De Joode et al., Reference De Joode, Van Dijk, Walburg, Bosmans, Van Marwijk, de Boer, Van Tulder and Adriaanse2019).
Previous systematic reviews on the validity of screening instruments have focused on a specific instrument (e.g. Edinburgh Postnatal Depression Scale; EPDS) (Gibson et al., Reference Gibson, McKenzie-Mcharg, Shakespeare, Price and Gray2009), outcome (e.g. depression) (Chorwe-Sungani and Chipps, Reference Chorwe-Sungani and Chipps2017), or income group (e.g. low- and middle-income countries; LAMIC) (Ali et al., Reference Ali, Ryan and DeSilva2016), but to our knowledge, no systematic review on test performance of brief screening instruments for CMDs in Arabic-speaking populations has been published. Despite the fact that Arabic is one of the most spoken languages in the world, with over 30 dialects and 274 million people that speak Arabic, research on Arabic-language questionnaires is limited (Easton et al., Reference Easton, Safadi, Wang and Hasson2017; Karnouk et al., Reference Karnouk, Boge, Lindheimer, Churbaji, Abdelmagid, Mohamad, Hahn and Bajbouj2021). Furthermore, last decades have known a steep increase in the number of Arabic-speaking refugees into other parts of the world, such as the Horn of Africa and Europe (UNHCR, 2019, 2021). Psychometrically sound and brief case-finding instruments are vital to scale-up mental health services for an adequate response to the mental health needs of Arabic-speaking refugees worldwide (Jefee-Bahloul et al., Reference Jefee-Bahloul, Bajbouj, Alabdullah, Hassan and Barkil-Oteo2016).
In this systematic review and meta-analysis, we provide an overview of the diagnostic accuracy of Arabic-language psychological distress screening instruments, based on all available evidence in Arabic-speaking adult populations.
Methods
This review was pre-registered in the International Prospective Register of Systematic Reviews (PROSPERO ID: CRD42018070645). We followed the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA-DTA) checklist (McInnes et al., Reference McInnes, Moher, Thombs, McGrath and Bossuyt2018); see online Supplementary Appendix 1.
Search strategy
We systematically searched EBSCO/APA PsycINFO, PubMed, Embase.com, Cochrane Library, and Scopus from inception until 22 January 2021, without language restrictions. The search was carried out by a medical information specialist. The following terms were used (including synonyms and closely related words) as index terms or free-text words: ‘Sensitivity and Specificity’, ‘Reference Standards’, ‘Diagnostic Self Evaluation’, ‘Common Mental Disorders’, and ‘Arabic speaking populations’. The full search strategy is attached as online Supplementary Appendix 2. We restricted the search to articles, proceeding papers, conference papers, and electronic collections. We also identified studies by screening literature lists of included studies (Prinsen et al., Reference Prinsen, Mokkink, Bouter, Alonso, Patrick, de Vet and Terwee2018).
Inclusion criteria
The full search yield was reviewed for inclusion by two independent reviewers (AdG/JU) on the basis of title and abstract. Both reviewers assessed full-texts of the remaining articles. Discrepancies were resolved by discussion, and remaining queries were discussed with a third reviewer (MS). The following inclusion criteria had to be met: Population – Arabic-speaking adults with no restrictions on setting. Index test – brief self-report questionnaires in Arabic on psychological distress, with no restrictions in terms of administration mode or administrator. We defined ‘brevity’ as 25 items or less, based on commonly used screening instruments (e.g. HSCL-25). We did not base our definition on, e.g. ‘time of administration’, given that time to complete a measure might vary among groups and literacy levels. Reference standard – a diagnosis made through a structured clinical interview or by a clinician based on the criteria of the Diagnostic and Statistical Manual of Mental Disorders (DSM) (American Psychiatric Association, 2013) or International Statistical Classification of Diseases and Related Health Problems (ICD) (WHO, 2019). Outcome – any CMD. CMDs refer to DSM/ICD diagnoses of anxiety, depressive (excluding bipolar), and stress-related disorders. Anxiety disorders include generalized anxiety disorder (GAD), panic disorder, phobia, agoraphobia, or social anxiety disorder. PTSD and acute stress disorder are included (as anxiety disorders in DSM-IV or as trauma- and stress-related disorders in DSM-5). We excluded papers in which the diagnosis was based on a questionnaire, observation checklist, chart review, or self-reported diagnosis. We also excluded studies that did not provide data to calculate sensitivity/specificity.
Data extraction
Data were extracted independently from each study by two reviewers (AdG/IS) using a coding scheme (The Cochrane Collaboration, 2020). Extracted data included study design (design and study dates), participant characteristics (eligibility criteria, setting, sample size, age, gender, nationality, and comorbidities), index test characteristics (description, time points, mode of administration, setting, translation, scale properties, and psychometric properties), reference test characteristics (description, time points, mode of administration, blinding, setting, translation, prevalence, and psychometric properties), and relevant outcomes measured (target condition, thresholds with corresponding diagnostic accuracy properties, i.e. sensitivity, specificity, area under the receiver-operating characteristic (ROC) curve (AUC), PPV and NPV, and data to generate 2 × 2 tables). Discrepancies were resolved by discussion.
Quality assessment
Risk of bias was independently assessed by two reviewers (AdG/IS) using the quality assessment tool of diagnostic accuracy studies (QUADAS-2) (Whiting et al., Reference Whiting, Rutjes, Westwood, Mallett, Deeks, Reitsma, Leeflang, Sterne and Bossuyt2011). QUADAS-2 is a generic set of criteria consisting of four key domains: patient selection, index test, reference standard, and flow of patients through the study and timing of the index test and reference standard. Signaling questions are included to judge risk of bias across all domains (Whiting et al., Reference Whiting, Rutjes, Westwood, Mallett, Deeks, Reitsma, Leeflang, Sterne and Bossuyt2011). We added three items to account for biases specific to the use of (semi-)structured clinical interviews. These extra items concerned (1) whether studies used a semi-structured interview v. clinician diagnosis (domain 3), (2) whether data on interviewer variation (e.g. inter-rater reliability) for the (semi)structured interview fell within an acceptable range (domain 3), and (3) whether all participants received a reference standard (domain 4). See online Supplementary Appendix 3 for item specifications.
Data synthesis and statistical analysis
We provided a narrative synthesis structured around the type of index test (i.e. questionnaire) and type of outcome. For every study, we tabulated the questionnaire, reported cut-off scores and outcome measures. In this review, we present cut-off scores as rounded numbers (e.g. ‘5’), whereby individuals are considered positive cases if they have that score at minimum (e.g. 5 or above). Meta-analysis was performed when at least three studies with a comparable outcome for a specific questionnaire were included. Multiple thresholds were modelled for studies reporting a range of cut-off scores (Steinhauser et al., Reference Steinhauser, Schumacher and Rücker2016) using the diagmeta package (Rucker et al., Reference Rucker, Steinhauser, Kolampally and Schwarzer2020) in R v3.6.1 (R Core Team, Reference R Core Team2019). This approach incorporates the following issues relevant for diagnostic reviews: (1) imprecision by which the sensitivity or specificity has been measured within each study, (2) variation beyond chance in the sensitivity and specificity between studies, and (3) correlation that might exist between sensitivity and specificity. It also estimates the sensitivity and specificity for a range of cut-off scores and determines the optimal threshold, based on the cut-off with the highest combination of sensitivity and specificity using the Youden index. We plotted the estimates of sensitivity and specificity for each reported cut-off and the optimal threshold of all studies in the meta-analysis in ROC space.
Results
Study inclusion and characteristics of included studies
The search yielded 3246 unique references (Fig. 1). Of these, 704 were identified as potentially relevant based on title/abstract screening. The full-text articles were obtained and assessed for inclusion. Thirty-two studies reporting on 30 unique datasets met the inclusion criteria. Of those, 17 studies were eligible for meta-analysis.
Seventeen different questionnaires on depression, anxiety, PTSD, and general distress were identified (Table 1). The number of items ranged from 5 to 25. Online Supplementary Appendix 4 provides a brief description of each questionnaire.
a A lower score indicates less wellbeing, and participants with a cut-off score of 12 or below are considered case positives; AUC = area under the Receiver Operating Curve; CI = Confidence Interval; PPV = positive predictive value; NPV = negative predictive value; EPDS = Edinburgh Postnatal Depression Scale; PHQ-9 = Patient Health Questionnaire; GDS-15 = Geriatric Depression Scale; BDI-II = Beck Depression Inventory; CES-D = Center for Epidemiological Studies Depression Scale; MDI = Major Depression Inventory; AES = Apathy Evaluation Scale; WHO-5 = WHO Well-being Index; PSST = Premenstrual Symptoms Screening Tool; GAD-7 = Generalized Anxiety Disorder-7; HADS = Hamilton Anxiety and Depression Scale; HSCL-25 = Hopkins Symptoms Checklist; PCAD = Primary Care Anxiety and Depression; SPTSS = Screen for Posttraumatic Stress Symptoms; SRQ-20 = Self-Reporting Questionnaire; GHQ-12 = General Health Questionnaire
One study was conducted among a sub-sample of Arabic-speaking migrants in Australia (Barnett et al., Reference Barnett, Matthey and Gyaneshwar1999), while all other studies (n = 31) were conducted in Arab countries. Participants (N = 4042; range 26–407), with mean age range 28–82 years, were selected from clinical settings (n = 21, 65.6%), community settings (n = 5, 15.6%), or both (n = 4, 12.5%). Nine (28.1%) studies included only women, two (6.3%) only men, 17 (53.1%) mixed samples, and two (6.9%) did not report gender (6.9%). None of the questionnaires were locally developed, but all were translations of English-language instruments: in the majority of studies, questionnaires (n = 20, 62.5%) were locally translated, five (15.6%) used/adapted already existing translations, and seven (21.9%) did not report on translation. In 20 studies (62.5%), questionnaires were administered by interviewers.
Twenty-two studies (68.7%) used a (semi-)structured clinical interview as reference standard. Seven studies used the Mini International Neuropsychiatry Inventory (MINI), five the Structured Clinical Interview for DSM (SCID), four the Composite International Diagnostic Interview (CIDI), three the Clinical Interview Schedule (CIS), two the Present State Examination (PSE), and one the Diagnostic Interview Schedule (DIS). These (semi-)structured interviews were conducted by a clinician (n = 13) or lay-interviewer (n = 3); six studies did not report on the type of interviewer. In the other 10 studies (31.3%), a clinician diagnosis according to the DSM/ICD was made. El-Hachem et al. (Reference El-Hachem, Rohayem, Bou, Richa, Kesrouani, Gemayel, Aouad, Hatab, Zaccak, Yaghi, Salameh and Attieh2014) combined the clinical interview with (readministration of) the index test.
Results of the systematic review
Nine depression-specific questionnaires were compared to a depression diagnosis (Table 1). The sensitivity in seven studies evaluating the EPDS ranged from 73% to 92%; its specificity ranged from 48% to 96%. The nine-item Patient Health Questionnaire (PHQ-9) was evaluated in four studies, with sensitivity ranging from 62% to 88%, and specificity from 46% to 96%. Three studies evaluated the Geriatric Depression Scale (GDS-15), with sensitivity ranging from 80% to 84%, and specificity from 87% to 91%. The other depression-specific instruments were evaluated by single studies. The Beck Depression Inventory-II (BDI-II) had a sensitivity of 96% and a specificity of 73%, the Center for Epidemiologic Studies Depression Scale (CES-D) had a sensitivity of 82% and a specificity of 83%, the Major Depression Inventory (MDI) had a sensitivity of 88% and a specificity of 79%, the Apathy Evaluation Scale (AES) had a sensitivity of 65% and a specificity of 63%, and the five-item WHO Well-being Index (WHO-5) had a sensitivity of 78% and a specificity of 83%. The Premenstrual Symptoms Screening Tool (PSST) was compared to a diagnosis of premenstrual dysphoric disorder and had a sensitivity of 27% and a specificity of 96%.
We found two anxiety-specific questionnaires. One study compared the seven-item Generalized Anxiety Disorder (GAD-7) to any anxiety disorder, with a sensitivity of 57% and a specificity of 53%, and one study compared the PHQ modules panic, with a sensitivity of 47% and a specificity of 96%, and GAD, with a sensitivity of 37% and a specificity of 96%, to corresponding DSM-IV criteria.
We found three instruments targeting combined anxiety/depression that were compared to a diagnosis of anxiety and/or depression. The HADS was evaluated in four studies. The sensitivity of the anxiety subscale ranged from 62% to 85%, and its specificity from 62% to 91%. The sensitivity range of the depression subscale was 54–90%, and specificity range 70–99%. One study evaluated the HSCL-25. The sensitivity of the anxiety subscale was 84%, and its specificity 59%. The sensitivity of the depression subscale was 82%, and its specificity 70%. The Primary Care Anxiety and Depression scale was evaluated in one study, which found a sensitivity of 82% and a specificity of 77%.
We found one instrument targeting PTSD symptoms. The Screen for Posttraumatic Stress Symptoms (SPTSS) had a sensitivity of 89% and a specificity of 89% compared to a PTSD diagnosis.
Lastly, we identified two general distress instruments that were compared to a diagnosis of any CMD. The 20-item Self-Reporting Questionnaire (SRQ-20) was investigated in six studies, of which one study also included a psychosis item. The sensitivity range was 71–100%; the specificity range 70–95%. The 12-item General Health Questionnaire (GHQ-12) was evaluated in one study and had a sensitivity of 83% and a specificity of 80%.
Online Supplementary Appendix 5 presents a visual representation for all instruments for which we included at least three studies.
Quality of studies
The QUADAS-2 results are evaluated at item-level and do not incorporate an overall quality score (Table 2). Eleven studies scored high risk of bias on one domain, 14 on two domains, three on three domains, and none on all four domains. Four studies did not score high risk on any of the domains.
Quality of studies rated as ☺ = low risk, ☹ = high risk, ? = unclear.
EPDS, Edinburgh Postnatal Depression Scale; PHQ-9, Patient Health Questionnaire; GDS-15, Geriatric Depression Scale; BDI-II, Beck Depression Inventory; CES-D, Center for Epidemiological Studies Depression Scale; MDI, Major Depression Inventory; AES, Apathy Evaluation Scale; WHO-5, WHO Well-being Index; PSST, Premenstrual Symptoms Screening Tool; GAD-7, Generalized Anxiety Disorder-7; HADS, Hamilton Anxiety and Depression Scale; HSCL-25, Hopkins Symptoms Checklist; PCAD, Primary Care Anxiety and Depression; SPTSS, Screen for Posttraumatic Stress Symptoms; SRQ-20, Self-Reporting Questionnaire; GHQ-12, General Health Questionnaire.
a Part of Arabic-speaking sub-sample completed test in English.
Risk of bias for Patient Selection was low in the majority of studies. Studies scored high risk if a case-control design was used (Fawzi et al., Reference Fawzi, Fawzi and Abu-Hindi2012) if participants were not recruited at random (Ghubash et al., Reference Ghubash, Daradkeh, Al Naseri, Al Bloushi and Al Daheri2000; Caspi et al., Reference Caspi, Carlson and Klein2007; Mahfoud et al., Reference Mahfoud, Kobeissi, Peters, Araya, Ghantous and Khoury2013), or in case of inappropriate exclusions (Al-Adawi et al., Reference Al-Adawi, Dorvlo, Burke, Huynh, Jacob, Knight, Shah and Al-Hussaini2004, Reference Al-Adawi, Dorvlo, Al-Naamani, Glenn, Karamouz, Chae, Zaidan and Burke2007; Alsuwaida and Alwahhabi, Reference Alsuwaida and Alwahhabi2006; Al-Asmi et al., Reference Al-Asmi, Dorvlo, Burke, Al-Adawi, Al-Zaabi, Al-Zadjali, Al-Sharbati, Al-Sharbati and Al-Adawi2012; Mahfoud et al., Reference Mahfoud, Kobeissi, Peters, Araya, Ghantous and Khoury2013, Reference Mahfoud, Emam, Anchassi, Omran, Alhaj, Al-Abdulla, El-Amin, Shehata, Aly, Al Emadi, Al-Meer and Al-Amin2019; Shaheen et al., Reference Shaheen, AlAtiq, Thomas, Alanazi, AlZahrani, Younis and Hussein2019). Risk was unclear in three studies, because the method of recruitment was unclear (Chaaya et al., Reference Chaaya, Sibai, El Roueiheb, Chemaitelly, Chahine, Al-Amin and Mahfoud2008; Sibai et al., Reference Sibai, Chaaya, Tohme, Mahfoud and Al-Amin2009; Hashim, Reference Hashim2018).
Studies were rated high risk for Index Test, because the questionnaire was completed after the reference standard and/or because the threshold was not pre-defined (El-Rufaie and Absood, Reference El-Rufaie and Absood1994, Reference El-Rufaie and Absood1995; El-Rufaie and Daradkeh, Reference El-Rufaie and Daradkeh1996; El-Rufaie et al., Reference El-Rufaie, Absood and Abou-Saleh1997; Al-Subaie et al., Reference Al-Subaie, Mohammed and Al-Malik1998, Reference Al-Arabi, Rahim, Al-Bar, AbuMadiny and Karim1999; Barnett et al., Reference Barnett, Matthey and Gyaneshwar1999; Ghubash et al., Reference Ghubash, Daradkeh, Al Naseri, Al Bloushi and Al Daheri2000; Agoub et al., Reference Agoub, Moussaoui and Battas2005; Alsuwaida and Alwahhabi, Reference Alsuwaida and Alwahhabi2006; Caspi et al., Reference Caspi, Carlson and Klein2007; Chaaya et al., Reference Chaaya, Sibai, El Roueiheb, Chemaitelly, Chahine, Al-Amin and Mahfoud2008; Sibai et al., Reference Sibai, Chaaya, Tohme, Mahfoud and Al-Amin2009; Al-Asmi et al., Reference Al-Asmi, Dorvlo, Burke, Al-Adawi, Al-Zaabi, Al-Zadjali, Al-Sharbati, Al-Sharbati and Al-Adawi2012; Fawzi et al., Reference Fawzi, Fawzi and Abu-Hindi2012; Mahfoud et al., Reference Mahfoud, Kobeissi, Peters, Araya, Ghantous and Khoury2013; El-Hachem et al., Reference El-Hachem, Rohayem, Bou, Richa, Kesrouani, Gemayel, Aouad, Hatab, Zaccak, Yaghi, Salameh and Attieh2014; Karam et al., Reference Karam, Khandakji, Sarkis Sahakian, Dandan and Karam2018; Naja et al., Reference Naja, Al-Kubaisi, Chehab, Al-Dahshan, Abuhashem and Bougmiza2019; Alzahrani et al., Reference Alzahrani, Demiroz, Alabdulwahab, Alshareef, Badri, Alharbi, Tawakkul and Aljaed2020).
Fourteen studies were rated high risk for Reference Test, because an unstructured clinician diagnosis rather than a semi-structured interview was used (El-Rufaie et al., Reference El-Rufaie, Absood and Abou-Saleh1997; Al-Subaie et al., Reference Al-Subaie, Mohammed and Al-Malik1998; Al-Arabi et al., Reference Al-Arabi, Rahim, Al-Bar, AbuMadiny and Karim1999; Chaaya et al., Reference Chaaya, Sibai, El Roueiheb, Chemaitelly, Chahine, Al-Amin and Mahfoud2008; Sibai et al., Reference Sibai, Chaaya, Tohme, Mahfoud and Al-Amin2009; El-Hachem et al., Reference El-Hachem, Rohayem, Bou, Richa, Kesrouani, Gemayel, Aouad, Hatab, Zaccak, Yaghi, Salameh and Attieh2014; Sawaya et al., Reference Sawaya, Atoui, Hamadeh, Zeinoun and Nahas2016; Hashim, Reference Hashim2018; Shaheen et al., Reference Shaheen, AlAtiq, Thomas, Alanazi, AlZahrani, Younis and Hussein2019), and/or because interviewers were not blinded (Agoub et al., Reference Agoub, Moussaoui and Battas2005; Ghubash et al., Reference Ghubash, Abou-Saleh and Daradkeh1997; Barnett et al., Reference Barnett, Matthey and Gyaneshwar1999). None of the studies reported interrater reliability.
In Flow and Timing, risk was high in eight studies, because of an inappropriate time interval between index and reference test (Ghubash et al., Reference Ghubash, Abou-Saleh and Daradkeh1997), and/or because not all participants were included in the analysis (El-Hachem et al., Reference El-Hachem, Rohayem, Bou, Richa, Kesrouani, Gemayel, Aouad, Hatab, Zaccak, Yaghi, Salameh and Attieh2014; Khalifa et al., Reference Khalifa, Glavin, Bjertness and Lien2015; Sawaya et al., Reference Sawaya, Atoui, Hamadeh, Zeinoun and Nahas2016; Karam et al., Reference Karam, Khandakji, Sarkis Sahakian, Dandan and Karam2018; Shaheen et al., Reference Shaheen, AlAtiq, Thomas, Alanazi, AlZahrani, Younis and Hussein2019).
Supplementary data and clarification were provided for three studies (Becker et al., Reference Becker, Al Zaid and Al Faris2002; Alsuwaida and Alwahhabi, Reference Alsuwaida and Alwahhabi2006; Al-Asmi et al., Reference Al-Asmi, Dorvlo, Burke, Al-Adawi, Al-Zaabi, Al-Zadjali, Al-Sharbati, Al-Sharbati and Al-Adawi2012) after correspondence with authors.
Results of the meta-analysis
We meta-analyzed studies (if at least three per questionnaire) reporting on the same questionnaire and comparable target condition. Optimal thresholds for the EPDS, HADS anxiety and depression subscales (HADS-A and HADS-D), and SRQ-20 could be estimated. Two studies on the SRQ-20 were excluded from meta-analysis, because of missing data to calculate the 2 × 2 table (El-Rufaie and Absood, Reference El-Rufaie and Absood1994), and because a 21-item version was used (Alsuwaida and Alwahhabi, Reference Alsuwaida and Alwahhabi2006). We also performed meta-analysis on the GDS-15, but results were unreliable due to limited data and therefore only presented in online Supplementary Appendix 6. Pooled AUC statistics were >0.80 for all questionnaires. The summary operating points per questionnaire at different thresholds are provided in Table 3 and visually presented in summary ROC (SROC) plots in Fig. 2. We also included the Youden index and ROC/SROC curves, and 2 × 2 tables in online Supplementary Appendix 6.
EPDS, Edinburgh Postnatal Depression Scale; SRQ-20, Self-Reporting Questionnaire.
a We reported the 95% CI of the AUC for sensitivity given specificity.
b The model estimated an optimal threshold for the EPDS of 11.08 (sensitivity = 76.5% and specificity = 85.5%).
c The model estimated an optimal threshold for the HADS-A of 7.17 (sensitivity = 70.3% and specificity = 80.1%).
d The model estimated an optimal threshold for the HADS-D of 5.97 (sensitivity = 73.2% and specificity = 88.4%).
e The model estimated an optimal threshold for the SRQ-20 of 8.36 (sensitivity = 86.0% and specificity = 83.9%).
Bold values signifies the best cut-off.
Our model identified 11.08 as optimal threshold for the EPDS (n = 7); resulting in a practically relevant optimal cut-off score of 11, with a pooled sensitivity of 76.9% (95% confidence interval [CI] 60.6–87.7) and a specificity of 85.2% (95% CI 78.4–90.1).
The HADS-A model (n = 4) identified 7.17 as an optimal threshold, indicating a practically relevant cut-off score of 7 with a pooled sensitivity of 71.9% (95% CI 41.9–90.1) and a specificity of 78.5% (95% CI 67.3–86.6). The HADS-D model (n = 4) identified 5.97 as an optimal threshold, with 6 as the closest, practically relevant cut-off score, having a pooled sensitivity of 73.0% (95% CI 48.9–88.4) and a specificity of 88.6% (95% CI 75.7–95.1). CIs for the sensitivity/specificity estimates of the HADS subscales were wide, also illustrated by widely varying ROC curves in Fig. 2, indicating low discriminative ability.
Finally, the SRQ-20 model (n = 4) identified 8.36 as an optimal threshold, indicating a practically relevant cut-off score of 8 with a pooled sensitivity of 86.0% (95% CI 78.0–91.4) and a specificity of 83.9% (95% CI 58.1–95.1). The questionnaire's CIs associated with the pooled specificity were particularly wide.
Discussion
Brief psychological screening instruments are commonly used in research and clinical practice for the measurement of symptom severity, but also as inexpensive, easy-to-administer tools for case-finding (Kagee et al., Reference Kagee, Tsai, Lund and Tomlinson2013; Olin et al., Reference Olin, Mccord, Kerker, Weiss, Hoagwood and Horwitz2017). This systematic review and meta-analysis investigated the diagnostic performance of brief, Arabic-language screening instruments in detecting the symptoms of CMDs.
We synthesized the current evidence of 17 questionnaires, including instruments targeting depression, anxiety, general distress, and PTSD. A first finding is that, while the majority of studies reported on depression-specific questionnaires, the evidence for PTSD-specific instruments is limited. We must note, however, that we excluded several papers on the validity of PTSD screening tools in mixed-language populations (Söndergaard et al., Reference Söndergaard, Ekblad and Theorell2003; Jakobsen et al., Reference Jakobsen, Thoresen and Johansen2011; Ibrahim et al., Reference Ibrahim, Ertl, Catani, Ismail and Neuner2018), since they did not separately report data on Arabic-speaking sub-samples. Another general finding is that we did not identify locally developed screening tools, and this review only synthesized evidence on Arabic translations of screeners originally developed in other settings.
The studies included in this review differed in many ways from each other. Studies varied with regard to target condition (e.g. major depressive disorder v. any mood disorder), population (e.g. pregnant women v. elderly), and setting (e.g. clinical sample in Sudan v. community sample in Lebanon). Although this review focused on Arabic-speaking populations, the global Arabic-speaking community cannot be considered as one monolithic cultural group with identical idioms of distress or manifestations of psychological distress (e.g. Hassan et al., Reference Hassan, Ventevogel, Jefee-Bahloul, Barkil-Oteo and Kirmayer2016). Modern Standard Arabic (formal Arabic) is the only standardized form of written Arabic and is commonly understood among Arabic-speakers. Questionnaires in written form should thus be applicable across Arabic-speaking populations. However, in the majority of studies, questionnaires were administered by an interviewer, and thus read aloud. Even if questionnaires were written in formal Arabic, interviewers and participants may have communicated (or clarified) using their local dialects. Furthermore, most screening instruments were locally translated, and this might have introduced minor linguistic differences between translations. All but one study were conducted in Arabic countries, and covered Arabic-speaking populations in both high-income countries (e.g. Saudi Arabia) and LAMICs (e.g. Egypt).
Meta-analytic evidence was provided for the EPDS, HADS, and SRQ-20. Although AUCs were high, this statistic summarizes overall model performance over all possible thresholds. In practice, however, a specific threshold is used to discriminate between cases and non-cases, and determines the number of false-negative and false-positive cases. Thus, a single cut-off score may not perform as good as expected by overall test performance.
The present review found that a cut-off of 11 on the EPDS maximized combined sensitivity (76.9%)/specificity (85.2%). This threshold is lower compared to the original cut-off of 13 in English-speaking populations (Cox et al., Reference Cox, Holden and Sagovsky1987). A recent meta-analysis of individual participant data (IPDMA) on the EPDS also found that a threshold of 11 maximized combined sensitivity (81%)/specificity (88%) (Levis et al., Reference Levis, Negeri, Sun, Benedetti and Thombs2020). Earlier reviews found the EPDS to be valid for non-English-speaking populations (Zubaran et al., Reference Zubaran, Schumacher, Roxo and Foresti2010; Russell et al., Reference Russell, Chikkala, Earnest, Viswanathan, Russell and Mammen2020). The EPDS is one of the most frequently studied instruments in perinatal populations in LAMICs (Chorwe-Sungani and Chipps, Reference Chorwe-Sungani and Chipps2017). Ali et al. (Reference Ali, Ryan and DeSilva2016) conclude that the instrument generally performs well in LAMICs, while a systematic review in low- and lower-middle income countries, without Arabic-speaking samples, found that none of the studies had an accuracy of >80% on all three accuracy parameters (sensitivity/specificity/PPV) (Shrestha et al., Reference Shrestha, Pradhan, Tran, Gualano and Fisher2016). The optimal cut-off score in our meta-analysis would miss almost a quarter of individuals with depression. Clinicians may therefore consider using a lower cut-off to identify potential cases for the purpose of triage (e.g. positive cases will be further assessed with a clinical interview). For example, a cut-off score of 9 would miss 15.6% of individuals with depression, but at the cost of screening 26.2% of non-cases as cases. However, in low-resourced settings where there is no capacity to assess all positive cases with a clinical interview, a high number of false positives (resulting from low specificity), is likely to overburden local health systems (Andersen et al., Reference Andersen, Joska, Magidson, O'Cleirigh, Lee, Kagee, Witten and Safren2020). In these settings, a higher cut-off with improved specificity might be preferable.
We found substantial heterogeneity in the test performance of the HADS. A cut-off of 7 was optimal for the HADS-A based on maximized combined sensitivity (71.9%) and specificity (78.5%), and of 6 for the HADS-D (sensitivity: 73.0%/specificity: 88.6%). CIs for HADS were wide, indicating uncertainty about the estimated psychometric properties. In a recent IPDMA on the accuracy of the HADS-D to estimate depression prevalence, Brehaut et al. (Reference Brehaut, Neupane, Levis, Wu, Sun, Krishnan, He, Bhandari, Negeri, Riehm, Rice, Azar, Yan, Imran, Chiovitti, Saadat, Cuijpers, Ioannidis, Markham, Patten, Ziegelstein, Henry, Ismail, Loiselle, Mitchell, Tonelli, Boruff, Kloda, Beraldi, Braeken, Carter, Clover, Conroy, Cukor, da Rocha e Silva, De Souza, Downing, Feinstein, Ferentinos, Fischer, Flint, Fujimori, Gallagher, Goebel, Jetté, Julião, Keller, Kjærgaard, Love, Löwe, Martin-Santos, Michopoulos, Navines, O'Rourke, Öztürk, Pintor, Ponsford, Rooney, Sánchez-González, Schwarzbold, Sharpe, Simard, Singer, Stone, Tung, Turner, Walker, Walterfang, White, Benedetti and Thombs2020) found the commonly used cut-off of 8 (‘doubtful cases’) significantly overestimated depression prevalence, while a cut-off of 11 (‘definite cases’) may either over- or underestimate depression prevalence. Ali et al. (Reference Ali, Ryan and DeSilva2016) conclude that the HADS-A is an adequate screener in LAMICs, but reported strong to very strong validity for primary studies that used the English (with Yoruba) version of HADS-A, and weak to strong validity for other language versions (Portuguese and Chinese) (Ali et al., Reference Ali, Ryan and DeSilva2016). Based on our meta-analyses and in line with Brehaut et al. (Reference Brehaut, Neupane, Levis, Wu, Sun, Krishnan, He, Bhandari, Negeri, Riehm, Rice, Azar, Yan, Imran, Chiovitti, Saadat, Cuijpers, Ioannidis, Markham, Patten, Ziegelstein, Henry, Ismail, Loiselle, Mitchell, Tonelli, Boruff, Kloda, Beraldi, Braeken, Carter, Clover, Conroy, Cukor, da Rocha e Silva, De Souza, Downing, Feinstein, Ferentinos, Fischer, Flint, Fujimori, Gallagher, Goebel, Jetté, Julião, Keller, Kjærgaard, Love, Löwe, Martin-Santos, Michopoulos, Navines, O'Rourke, Öztürk, Pintor, Ponsford, Rooney, Sánchez-González, Schwarzbold, Sharpe, Simard, Singer, Stone, Tung, Turner, Walker, Walterfang, White, Benedetti and Thombs2020), the evidence for the validity of the Arabic HADS is questionable.
The SRQ-20 as a screener for CMDs maximized combined sensitivity/specificity at a cut-off of 8 (86.0% and 83.9%, respectively). In other words, 14% of individuals with a disorder will remain undetected, while 16.1% of individuals without a disorder screen positive. The CIs for specificity were relatively wide. We therefore suggest that the SRQ-20 cut-off of 8 is useful for screening purposes to rule out the presence of any CMD, but that the questionnaire might be less reliable for ruling in because of uncertainty about the pooled specificity. A cut-off of 8 is commonly used (Harpham et al., Reference Harpham, Reichenheim, Oser, Thomas, Hamid, Jaswal, Ludermir and Aidoo2003), although prior research has shown that optimal thresholds for the SRQ-20 differ considerably across settings, languages, cultures, and gender (e.g. Harding et al. Reference Harding, De Arango, Baltazar, Climent, Ibrahim, Ladrido-Ignacio and Wig1980; Ventevogel et al. Reference Ventevogel, De Vries, Scholte, Shinwari, Faiz, Nassery, van den Brink, van den Brink and Olff2007). For example, a cut-off score of 6 gave the best sensitivity/specificity balance in two studies in low-resource primary care settings in Eritrea and South Africa. Both studies also found that performance improved among men by using an even lower cut-off (Van der Westhuizen et al., Reference Van der Westhuizen, Wyatt, Williams, Stein and Sorsdahl2017; Netsereab et al., Reference Netsereab, Kifle, Tesfagiorgis, Habteab, Weldeabzgi and Tesfamariam2018).
This review has several strengths and limitations. A strength is that it provides researchers and clinicians working with Arabic-speaking populations with an overview of the validity of brief screening tools, and empirically grounded recommendations for thresholds. We provided the results of multi-threshold models, rather than bivariate models in which only one threshold per study can be pooled. In doing so, we were able to provide the pooled accuracy statistics at different cut-off scores, allowing researchers and clinicians to decide which threshold is most suitable (e.g. for epidemiological studies v. screening in stepped care).
A limitation of this paper concerns the wide range of reference standards used, including both (semi-)structured interviews and (unstructured) clinician diagnoses. Clinician diagnoses may be less reliable than (semi-)structured interviews (Segal and Williams, Reference Segal, Williams, Beidel, Frueh and Hersen2014). The literature, however, also highlights the limitations of structured interviews. For example, the MINI may overestimate the presence of mental disorders (Levis et al., Reference Levis, Benedetti, Riehm, Saadat, Levis, Azar, Rice, Chiovitti, Sanchez, Cuijpers, Gilbody, Ioannidis, Kloda, McMillan, Patten, Shrier, Steele, Ziegelstein, Akena, Arroll, Ayalon, Baradaran, Baron, Beraldi, Bombardier, Butterworth, Carter, Chagas, Chan, Cholera, Chowdhary, Clover, Conwell, de Man-van Ginkel, Delgadillo, Fann, Fischer, Fischler, Fung, Gelaye, Goodyear-Smith, Greeno, Hall, Hambridge, Harrison, Hegerl, Hides, Hobfoll, Hudson, Hyphantis, Inagaki, Ismail, Jetté, Khamseh, Kiely, Lamers, Liu, Lotrakul, Loureiro, Löwe, Marsh, McGuire, Sidik, Munhoz, Muramatsu, Osório, Patel, Pence, Persoons, Picardi, Rooney, Santos, Shaaban, Sidebottom, Simning, Stafford, Sung, Tan, Turner, van der Feltz-Cornelis, van Weert, Vöhringer, White, Whooley, Winkley, Yamada, Zhang and Thombs2018; Wu et al., Reference Wu, Levis, Sun, Krishnan, He, Riehm, Rice, Azar, Yan, Neupane, Bhandari, Imran, Chiovitti, Saadat, Boruff, Cuijpers, Gilbody, McMillan, Ioannidis, Kloda, Patten, Shrier, Ziegelstein, Henry, Ismail, Loiselle, Mitchell, Tonelli, Al-Adawi, Beraldi, Braeken, Büel-Drabe, Bunevicius, Carter, Chen, Cheung, Clover, Conroy, Cukor, da Rocha e Silva, Dabscheck, Daray, Douven, Downing, Feinstein, Ferentinos, Fischer, Flint, Fujimori, Gallagher, Gandy, Goebel, Grassi, Härter, Jenewein, Jetté, Julião, Kim, Kim, Kjærgaard, Köhler, Loosman, Löwe, Martin-Santos, Massardo, Matsuoka, Mehnert, Michopoulos, Misery, Navines, O'Donnell, Öztürk, Peceliuniene, Pintor, Ponsford, Quinn, Reme, Reuter, Rooney, Sánchez-González, Schwarzbold, Cankorur, Shaaban, Sharpe, Sharpe, Simard, Singer, Stafford, Stone, Sultan, Teixeira, Tiringer, Turner, Walker, Walterfang, Wang, White, Wong, Benedetti and Thombs2020). Another limitation is related to the quality of studies, with 17 studies scoring high risk of bias on at least two QUADAS-2 domains. The majority of studies did not pre-specify a cut-off score, which may lead to overestimation of the accuracy estimates (Whiting et al., Reference Whiting, Rutjes, Westwood, Mallett, Deeks, Reitsma, Leeflang, Sterne and Bossuyt2011). Furthermore, for some questionnaires, primary studies differed with respect to target condition and reported thresholds, due to which we could not meta-analyze those studies (e.g. PHQ-9). Due to low numbers of studies per questionnaire, we could not perform further subgroup analyses. Consequently, we included both antenatal and postnatal, as well as female-only and male-only samples in our meta-analysis on the EPDS, while these sub-samples may require different thresholds (Matthey et al., Reference Matthey, Henshaw, Elliott and Barnett2006; Gibson et al., Reference Gibson, McKenzie-Mcharg, Shakespeare, Price and Gray2009; Ali et al., Reference Ali, Ryan and DeSilva2016). We were also not able to investigate differences across Arabic-speaking populations (e.g. by country).
The clinical implications of this review are that a cut-off of 11 on the Arabic-language EPDS could be used as a screener for depression in perinatal populations to optimize a balance between sensitivity/specificity. For ruling out the presence of any CMD with the SRQ-20, we recommend using a cut-off score of 8. The evidence for the HADS to screen for depression and/or anxiety was not convincing as results were substantially heterogeneous.
This review also stresses the paucity of evidence on anxiety and PTSD screeners. Future studies are needed to investigate the diagnostic accuracy of questionnaires to detect anxiety and PTSD in Arabic-speaking populations given the amount of Arabic-speaking refugees at risk for developing stress-related disorders (Peconga and Høgh Thøgersen, Reference Peconga and Høgh Thøgersen2020). According to our QUADAS-2 assessment, future studies can be improved by using semi-structured interviews as reference standard, such as the SCID, and report on the interrater reliability. We recommend pre-defining thresholds to prevent the overestimation of accuracy estimates.
Conclusions
This review identified 17 brief questionnaires in the Arabic language that were investigated on diagnostic performance, with limited availability of evidence for PTSD instruments. The meta-analysis provided optimal cut-off scores for the EPDS, HADS, and SRQ-20.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/gmh.2021.39
Data
The data that support the findings of the meta-analysis are available in the online Supplementary material of this article.
Financial support
This review was funded by ZonMw (The Netherlands Organisation for Health Research and Development; project number 636601004) and Horizon 2020 the Framework Programme for Research and Innovation (2014–2020). The content of this article reflects only the authors’ views and the European Community is not liable for any use that may be made of the information contained therein.
Conflict of interest
None.