Heterogeneity is a key feature of almost all psychiatric disorders (Kapur, Phillips, & Insel, Reference Kapur, Phillips and Insel2012; Kendell & Jablensky, Reference Kendell and Jablensky2003). Psychiatric patients usually present with a wide variety of clinical features [e.g. symptom patterns or treatment response (Georgiades, Szatmari, & Boyle, Reference Georgiades, Szatmari and Boyle2013; Kofler et al., Reference Kofler, Sarver, Spiegel, Day, Harmon and Wells2017; Monroe & Anderson, Reference Monroe and Anderson2015; Picardi et al., Reference Picardi, Viroli, Tarsitani, Miglio, de Girolamo, Dell'Acqua and Biondi2012; Volavka & Citrome, Reference Volavka and Citrome2009)], and different underlying biological disturbances could be at play for patients with the same diagnosis (Ozomaro, Wahlestedt, & Nemeroff, Reference Ozomaro, Wahlestedt and Nemeroff2013). Identification of more homogeneous diagnostic (sub)groups within larger diagnostic groups (e.g. depression, developmental disorders, psychosis) is often proposed as a starting point for increasing our understanding of more patient-specific etiological mechanisms, and thus, to advance the development of more biologically-informed, patient-specific diagnoses, and personalized treatment (e Silva, Reference e Silva2013; Kapur et al., Reference Kapur, Phillips and Insel2012; Ozomaro et al., Reference Ozomaro, Wahlestedt and Nemeroff2013).
Identification of psychiatric diagnoses and subtypes has traditionally been based on clinical judgement and consensus (Kendler, Reference Kendler2009). Data-driven cluster analyses can be used to further reduce psychopathological heterogeneity by identifying patterns in data that are missed by clinical pattern recognition (Marquand, Wolfers, Mennes, Buitelaar, & Beckmann, Reference Marquand, Wolfers, Mennes, Buitelaar and Beckmann2016). Although the call to apply data-driven approaches to psychiatric disease classification has been around for decades (Kendell, Reference Kendell1989), their popularity rose notably in recent years (Beijers, Wardenaar, van Loo, & Schoevers, Reference Beijers, Wardenaar, van Loo and Schoevers2019b; Librenza-Garcia et al., Reference Librenza-Garcia, Kotzian, Yang, Mwangi, Cao, Pereira Lima and Passos2017; Lombardo, Lai, & Baron-Cohen, Reference Lombardo, Lai and Baron-Cohen2019; Marquand et al., Reference Marquand, Wolfers, Mennes, Buitelaar and Beckmann2016; Schnack, Reference Schnack2017; Van Loo, De Jonge, Romeijn, Kessler, & Schoevers, Reference Van Loo, De Jonge, Romeijn, Kessler and Schoevers2012). This is likely due to a combination of factors, including the availability of suitable datasets, increased computational capabilities and ongoing advances in the fields of statistics and machine learning that make it possible to extract information from complex and high-dimensional data (Ahmad & Fröhlich, Reference Ahmad and Fröhlich2016; Lin & Hsien-Yuan, Reference Lin and Hsien-Yuan2017; Marquand et al., Reference Marquand, Wolfers, Mennes, Buitelaar and Beckmann2016). Data-driven clustering techniques have been used to gather evidence about possible subtypes in a broad range of psychiatric patient populations, including depression (Beijers et al., Reference Beijers, Wardenaar, van Loo and Schoevers2019b; Van Loo et al., Reference Van Loo, De Jonge, Romeijn, Kessler and Schoevers2012), psychosis (Chand et al., Reference Chand, Dwyer, Erus, Sotiras, Varol, Srinivasan and Davatzikos2019; Lewandowski, Baker, McCarthy, Norris, & Öngür, Reference Lewandowski, Baker, McCarthy, Norris and Öngür2018; Reser, Allott, Killackey, Farhall, & Cotton, Reference Reser, Allott, Killackey, Farhall and Cotton2015), bipolar disorder (Librenza-Garcia et al., Reference Librenza-Garcia, Kotzian, Yang, Mwangi, Cao, Pereira Lima and Passos2017) and developmental disorders (e.g. attention deficit hyperactivity disorder (Mostert et al., Reference Mostert, Hoogman, Onnink, van Rooij, von Rhein, van Hulzen and Franke2018), autism spectrum disorder (Lombardo et al., Reference Lombardo, Lai and Baron-Cohen2019)).
The predominant approach used in psychiatry has been unsupervised learning in the form of finite mixture models (FMMs) and clustering algorithms (i.e. k-means clustering, hierarchical clustering, and community detection) (Marquand et al., Reference Marquand, Wolfers, Mennes, Buitelaar and Beckmann2016). Unsupervised methods have been widely used for discovering subtypes within clinical groups because supervised learning, which aims to correctly predict the subject labels (e.g. patients v. healthy control), is fundamentally limited by the quality of the clinical labels and cannot be used to investigate the validity of these labels (Wolfers, Buitelaar, Beckmann, Franke, & Marquand, Reference Wolfers, Buitelaar, Beckmann, Franke and Marquand2015). Unsupervised learning does not use labels but rather attempts to find subgroups based on data structure and heuristics used by each algorithm. Although the use of data-driven clustering techniques seems promising, there is also a reason for caution. Scientific results are known to not always be robust and specifics of a chosen analytical method can have a significant influence on research outcomes (Simmons, Nelson, & Simonsohn, Reference Simmons, Nelson and Simonsohn2011; Steegen, Tuerlinckx, Gelman, & Vanpaemel, Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016; Silberzahn et al., Reference Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey, Nosek and A.2018). In case of cluster analyses, however, there is usually no way of knowing if the results of a presented analysis would have been the same if different model specifications had been used, as researchers will generally perform only one or two separate analyses (Marquand et al., Reference Marquand, Wolfers, Mennes, Buitelaar and Beckmann2016). Better insight into the effects of model specifications on unsupervised clustering results could greatly improve our understanding of data-driven psychiatric subtyping. In addition, it could provide leads for data-driven subtypes of MDD by identification of patterns that are robust to methodological variation.
In unsupervised learning, analytical variations across studies are a realistic risk because of the large availability of different model specifications for unsupervised learning algorithms. This is likely due to the lack of a straightforward way to judge the quality of unsupervised learning results because there is no outcome measure, as opposed to supervised learning, which either succeeds at predicting a predefined outcome or not (Hastie, Tibshirani, & Friedman, Reference Hastie, Tibshirani and Friedman2011). We decided to focus on k-means and hierarchical clustering because these have been shown to be the most commonly used methods across disorders (Marquand et al., Reference Marquand, Wolfers, Mennes, Buitelaar and Beckmann2016) and FMMs have previously been shown to have a number of issues (Borsboom et al., Reference Borsboom, Rhemtulla, Cramer, Van Der Maas, Scheffer and Dolan2016; Hagenaars, Reference Hagenaars1988; van Loo, Wanders, Wardenaar, & Fried, Reference van Loo, Wanders, Wardenaar and Fried2016). Within k-means/hierarchical clustering, there are three main aspects of the method that can vary: (1) algorithm, (2) distance metric (used to determine dissimilarity between data points) and (3) fit index (decides which is the optimal number of clusters). When investigating the 13 studies mentioned by Marquand et al. (Reference Marquand, Wolfers, Mennes, Buitelaar and Beckmann2016), we found that k-means clustering was used most often, but that a specific rationale or justification for this choice was generally not given (8/13). This is likely due to the fact that because of the aforementioned lack of gold standard, we rely on simulation studies for algorithms (Clifford, Wessely, Pendurthi, & Emes, Reference Clifford, Wessely, Pendurthi and Emes2011; Ferreira & Hitchcock, Reference Ferreira and Hitchcock2009; Hands & Everitt, Reference Hands and Everitt1987; Saraçli, Doǧan, & Doǧan, Reference Saraçli, Doǧan and Doǧan2013) as well as distances (Clifford et al., Reference Clifford, Wessely, Pendurthi and Emes2011; Saraçli et al., Reference Saraçli, Doǧan and Doǧan2013) and fit indices (Islam, Alizadeh, van den Heuvel, & GROUP investigators, Reference Islam, Alizadeh and van den Heuvel2015; Milligan & Cooper, Reference Milligan and Cooper1985). These studies are performed only rarely and generally have mixed results (Clifford et al., Reference Clifford, Wessely, Pendurthi and Emes2011; Ferreira & Hitchcock, Reference Ferreira and Hitchcock2009; Hands & Everitt, Reference Hands and Everitt1987; Islam et al., Reference Islam, Alizadeh and van den Heuvel2015; Milligan & Cooper, Reference Milligan and Cooper1985; Saraçli et al., Reference Saraçli, Doǧan and Doǧan2013).
The current study aimed to identify clusters in a psychiatric sample and to gain insight into the effects of different model specifications on the results by applying a Specification-Curve Analysis [SCA (Simonsohn, Simmons, & Nelson, Reference Simonsohn, Simmons and Nelson2020)] to a selected group of unsupervised machine learning algorithms (k-means clustering and six hierarchical clustering algorithms). SCA was developed to investigate the effects of methodological variations on regression results in psychology but can be also applied to study the effect of different model specifications in unsupervised machine learning analyses. When applied to the current case of cluster analysis, SCA considers the results of a large range of model specifications jointly, instead of using cluster analysis with just one or two model specifications. Because SCA has never been applied to cluster analysis before, we also investigated the influence of data properties such as the true number of existing clusters in the data and varying levels of noise on the SCA outcomes.
For this study, we focused on the identification of biological proteomics-based subtypes of MDD. There have been increasing efforts to identify homogeneous clusters of MDD patients, mainly based on clinical data. The results of these studies tend to be unstable or find subtypes mainly based on severity (Van Loo et al., Reference Van Loo, De Jonge, Romeijn, Kessler and Schoevers2012). Fewer efforts have been based on biological measures (Beijers et al., Reference Beijers, Wardenaar, van Loo and Schoevers2019b). There are some indications that biology-based clustering suffers from a similar degree of variation, likely due (at least in part) to the large variability in used methodology (Beijers et al., Reference Beijers, Wardenaar, van Loo and Schoevers2019b). In this study, we investigated if proteomic-based subtypes are indeed sensitive to different model specifications, or that we could find robust subtypes using proteomics data. Our specific aims were to (1) evaluate the influence of model specifications on the number of identified data-driven biological clusters in MDD, (2) to investigate if SCA identifies clusters with distinct biological patterns that are robust to variations in model specifications, and (3) to run simulations to investigate how data properties influence SCA cluster results.
We investigated the presence of data-driven biological clusters of depression and evaluated the effect of different model specifications on these findings. The cluster-analysis results based on our sample of MDD patients were very sensitive to the model specifications used. The SCA showed that the number of identified clusters was inconsistent, and that cluster allocation stability was low. Together, these observations indicated no robust cluster structure in the real dataset. This was also the case for the sample including healthy controls. Moreover, our analyses showed that many specifications will result in a cluster solution even when no structure is present in the data. The simulation study showed that it is possible for SCA to correctly identify clusters as the most consistent solution if they are present in the data, but that this becomes more difficult with large number of clusters and/or higher noise levels. Below, implications of these results are discussed.
As discussed in the introduction, the variability in results of previous cluster analyses raises inevitable questions about how much confidence we should put in results from a single cluster analysis, especially when this single analysis lacks replication in independent samples and clinical validation (e.g. differences in risk factors or course) (Beijers et al., Reference Beijers, Wardenaar, van Loo and Schoevers2019b; Marquand et al., Reference Marquand, Wolfers, Mennes, Buitelaar and Beckmann2016; Van Loo et al., Reference Van Loo, De Jonge, Romeijn, Kessler and Schoevers2012). Our study aimed to investigate if the faith in model results improves when SCA is applied. The simulation results are somewhat encouraging, but the lack of a robust cluster structure in the real dataset including the one with both MDD patients and healthy controls raises several concerns. How can we explain that the NESDA study found differences in biomarkers between cases and controls, but we do not find them in cluster analyses using the same biomarkers? Should the results bring into question the applicability of cluster techniques to biological data and therefore caution against any future use of such techniques?
It is possible that we did not find clusters in the real dataset because of technical issues. It could be, for instance, that the differences between cases and controls are too small to be picked up by cluster analysis, or that there is no sufficient correlation between the biomarkers or that the signal-to-noise ratio is insufficient for cluster detection.
Alternatively, the fact that the SCA was not able to distinguish between MDD patients and controls could indicate that the DSM categories cannot be validated using this specific type of biological data. Some, but not all, of the used biomarkers have been shown to be associated with depression before. For example, macrophage migration inhibitory factor, a pleiotropic cytokine, has been shown to be higher in MDD patients compared with controls in five out of six studies (Bloom & Al-Abed, Reference Bloom and Al-Abed2014). Interleukin-1 receptor antagonist has also been shown to be increased in patients compared to controls (Maes et al., Reference Maes, Bosmans, De Jongh, Kenis, Vandoolaeghe and Neels1997; Milaneschi et al., Reference Milaneschi, Corsi, Penninx, Bandinelli, Guralnik and Ferrucci2009). The von Willebrand factor, a marker involved in hemostasis, was previously found to be increased in one study (Dominici et al., Reference Domenici, Willé, Tozzi, Propenko, Miller, McKeown and Muglia2010), which is supported by earlier genetic findings of an association between depressive symptoms and a specific von Willebrand allele in cardiac patients (McCaffery et al., Reference McCaffery, Duan, Frasure-Smith, Barhdadi, Lespérance, Théroux and Dubé2009). Pancreatic polypeptide, which was elevated in patients, has been linked to anorexia nervosa (Batterham et al., Reference Batterham, Le Roux, Cohen, Park, Ellis, Patterson and Bloom2003), and another member of the pancreatic polypeptide family, peptide YY, was (marginally) positively related to depressive symptoms in older adults (Powell et al., Reference Powell, McGuffin, D'Souza, Cohen-Woods, Hosang, Martin and Schalkwyk2014). The other individual markers that were identified by Bot et al. (Reference Bot, Chan, Jansen, Lamers, Vogelzangs, Steiner and Bahn2015) were not associated with MDD in previous studies, or have not previously been investigated. For instance, the lower levels of growth-regulated alpha protein were in contrast with a study that found higher levels – although this result was not significant in the validation cohort (Powell et al., Reference Powell, McGuffin, D'Souza, Cohen-Woods, Hosang, Martin and Schalkwyk2014).
The simulation results indicated that it is difficult to identify stable/robust clusters, even when they do in fact exist, as they showed the analyses' sensitivity to data complexity (i.e. number of clusters), increased noise and/or the presence/number of outliers. This is also the case for analyses based on single specification simulations (Hands & Everitt, Reference Hands and Everitt1987). In some cases (i.e. low numbers of clusters, little noise), it is likely still possible to identify any robust clusters present with SCA. In that case, results should be considered much more reliable than that of a single analysis, because the former is robust to differences in model specifications. This has already been shown in social psychology, where for example the negative impact of racial bias on callback rates in job application processes has been shown to be robust, whereas increased death toll of female-named hurricanes was not (Simonsohn et al., Reference Simonsohn, Simmons and Nelson2020).
Our study should be considered in light of the following limitations. First, we used 31 biomarkers that were previously shown to differ between patients with current MDD and healthy controls using adjusted linear regression (Bot et al., Reference Bot, Chan, Jansen, Lamers, Vogelzangs, Steiner and Bahn2015). It is possible that other biochemical markers are more suitable for finding clusters of MDD patients. Currently, it is unknown which measures are best suited for biological subtyping of depression (Beijers et al., Reference Beijers, Wardenaar, van Loo and Schoevers2019b), so it could also be that brain structure or functional connectivity (Hasler & Northoff, Reference Hasler and Northoff2011) or genetic background (Flint & Kendler, Reference Flint and Kendler2014) could be more suitable for clustering MDD patients. Furthermore, it could be that inter-personal variations in psychiatric samples are better captured by continuous distributions [e.g. severity dimension(s)] rather than discrete clusters (Islam et al., Reference Islam, Habtewold, van Es, Quee, van den Heuvel, Alizadeh and van Winkel2018; van Loo et al., Reference van Loo, Wanders, Wardenaar and Fried2016; Wanders et al., Reference Wanders, van Loo, Vermunt, Meijer, Hartman, Schoevers and de Jonge2016; Wardenaar, Wanders, ten Have, de Graaf, & de Jonge, Reference Wardenaar, Wanders, ten Have, de Graaf and de Jonge2017).
Second, SCA has traditionally been used in psychology to investigate the effects of using alternative regression models (Orben & Przybylski, Reference Orben and Przybylski2019; Rohrer, Egloff, & Schmukle, Reference Rohrer, Egloff and Schmukle2017; Simonsohn et al., Reference Simonsohn, Simmons and Nelson2020). Cluster techniques are more complex. Two three-cluster solutions may be completely different in size and subject allocation, whereas a two- and a three-cluster solution may be partially overlapping. It is therefore important to keep in mind that this application of SCA focuses mainly on the resulting number of clusters and cluster stability, rather than the substantive interpretation of the clusters. Had we found an optimal number of clusters (K optimal) with a stable model solution, we would have investigated if the movement of subjects between models with K optimal −1 and with K optimal was stable. If this would have been the case, we would have investigated the movement of subjects between models with ever-decreasing K, in order to investigate if there was a stable division tree to be made all the way from K = 1 to K optimal.
Third, we used a limited number of model specifications for unsupervised learning. We focused on k-means clustering and hierarchical clustering because these are among the most commonly used methods across disorders (Marquand et al., Reference Marquand, Wolfers, Mennes, Buitelaar and Beckmann2016) and FMMs have been shown to have a number of issues that limit their usefulness for psychiatric classification. FMMs tend to detect groups with different severity levels, which is not always the aim of cluster analysis and local dependence between variables can obfuscate the results (Borsboom et al., Reference Borsboom, Rhemtulla, Cramer, Van Der Maas, Scheffer and Dolan2016; Hagenaars, Reference Hagenaars1988; van Loo et al., Reference van Loo, Wanders, Wardenaar and Fried2016). Because there is insufficient evidence on which model clustering algorithms, distances and fit indices are most useful for a study like ours, we decided to study all of the potential model specifications and not to exclude any a priori. We decided to use the exhaustive list of options in the NbClust R-package, which was designed to gather all indices available in SAS and R packages together into a single one package as well as some newer indices that are not implemented anywhere else yet (Charrad et al., Reference Charrad, Ghazzali, Boiteau and Niknafs2014).
Fourth, we did not perform a Monte Carlo SCA but rather used SCA to evaluate the result obtained in a single simulation study. There is no Monte Carlo element in our procedure as we did not seek to quantify clustering quality of SCA or a single specification per se. Rather, our simulations aimed to evaluate whether, in the presence of a known number of clusters in a population, SCA can robustly show this number across different model specifications. Therefore, we used simulated datasets to illustrate the use of SCA under different circumstances (different numbers of clusters, noise levels). In total, we only simulated 15 datasets (i.e. 2, 5 and 10 clusters with 5 different noise levels). We chose to simulate different noise levels by increasing the number of outliers (Saraçli et al., Reference Saraçli, Doǧan and Doǧan2013), varying the number of informative variables (Clifford et al., Reference Clifford, Wessely, Pendurthi and Emes2011) and different degrees of separation between the clusters (Clifford et al., Reference Clifford, Wessely, Pendurthi and Emes2011; Ferreira & Hitchcock, Reference Ferreira and Hitchcock2009; Milligan, Reference Milligan1980) (i.e. increasing variance), but other methods of simulating noisy datasets also exist (Milligan, Reference Milligan1980).
Finally, it is important to remember that there are still many sources of variation left in our analyses, as can be seen in Fig. 1. For example, we limited our analysis to a single MDD dataset with a limited set of markers, because the primary focus was on the influence of model specifications on the results and not on the effects of different data-processing choices. Furthermore, we chose to exclude clusters smaller than 1% of the data, under the assumption that these are likely to represent methodological artifacts or outliers rather than true cluster structure in the data. Arguably, other approaches to such ‘nuisance clusters’ could have been equally valid. The same goes for the way we chose to estimate the model results under the null hypothesis for the real datasets.