The 9/11 attacks on the World Trade Center (WTC) left thousands of casualties and drastically affected the lives of hundreds of thousands of New Yorkers and others nearby (Bergen, Reference Bergen2019). Many affected were those dedicating their lives to the safety of others – police, firefighters, emergency medical personnel, and other responders to the crisis. There has been a significant physical and mental burden of the events that day which has left many struggling with their health as they age (Durkin, Reference Durkin2018; Luft et al., Reference Luft, Schechter, Kotov, Broihier, Reissman, Guerrera and Bromet2012). Many responders suffer from PTSD which has been either worsening, staying the same, or gradually improving over time (Cukor et al., Reference Cukor, Wyka, Mello, Olden, Jayasinghe, Roberts and Difede2011; Neria et al., Reference Neria, Olfson, Gameroff, DiGrande, Wickramaratne and Gross2010).
Massive disasters, such as the WTC attacks, can affect a large number of people at the same time and usually occur within a relatively short period. Illuminating the risk and protective factors that reliably predict future reductions or increases in PTSD symptoms can lead to improved understanding, more accessible in-clinic guidance on patient's well-being, and more immediate care for those involved in catastrophic events. Previous work has made major headway in establishing longitudinal associations of exposure severity, demographic characteristics, and job duties with health trajectories of WTC responders (Bromet et al., Reference Bromet, Hobbs, Clouston, Gonzalez, Kotov and Luft2016; Cone et al., Reference Cone, Li, Kornblith, Gocheva, Stellman, Shaikh and Bowler2015; Pietrzak et al., Reference Pietrzak, Feder, Singh, Schechter, Bromet and Katz2014). However, additional approaches to risk assessment are needed to more rapidly and thoroughly differentiate those at greatest risk in situations where structured approaches to data collection are not possible.
Recently, Artificial Intelligence (AI)-based techniques have begun to show promise for quickly and accurately assessing mental health from human behavioral data, such as language use patterns. For example, from social media language, researchers have predicted those more prone to post-partum depression (De Choudhury, Kiciman, Dredze, Coppersmith, & Kumar, Reference De Choudhury, Kiciman, Dredze, Coppersmith and Kumar2016), those more likely to receive a clinical diagnosis of depression (Eichstaedt et al., Reference Eichstaedt, Smith, Merchant, Ungar, Crutchley, Preoţiuc-Pietro and Schwartz2018) or those appearing at greatest risk of suicide (Matero et al., Reference Matero, Idnani, Son, Giorgi, Vu, Zamani and Schwartz2019; Zirikly, Resnik, Uzuner, & Hollingshead, Reference Zirikly, Resnik, Uzuner and Hollingshead2019). For PTSD in particular, although studies have yet to validate models in a clinical setting, past work has shown that AI-based language techniques can distinguish Twitter users that have publicly disclosed a diagnosis of the condition from random selections of users (e.g. Coppersmith, Dredze, & Harman, Reference Coppersmith, Dredze and Harman2014; Preotiuc-Pietro, Sap, Schwartz, & Ungar, Reference Preotiuc-Pietro, Sap, Schwartz and Ungar2015; Reece et al. Reference Reece, Reagan, Lix, Dodds, Danforth and Langer2017).
AI-based language analyses are strong candidates to improve risk assessments in a clinical setting because they enable a much wider range of responses (like an interview) whereby a score can be objectively determined (e.g. like standardized assessment). Once an AI-based technique is created (i.e. it is ‘pre-trained’), it will always yield the same and robust score for a given input. While these language-based assessments were studied with social media texts for PTSD based on self-disclosures (Coppersmith et al., Reference Coppersmith, Dredze and Harman2014; Preotiuc-Pietro et al., Reference Preotiuc-Pietro, Sap, Schwartz and Ungar2015), few works have investigated how effective these approaches are with the language outside social media for predicting PTSD severity evaluated in the clinical settings, especially in a longitudinal study context for PTSD future trajectories. In all such cases, modern machine learning techniques are used to automatically extract and quantify patterns of language from hundreds to thousands of words per individual, which are then used to automatically produce a mental health or risk score. As compared to traditional questionnaire-based assessments, such approaches seem to suffer from fewer self-report biases (Youyou, Kosinski, & Stillwell, Reference Youyou, Kosinski and Stillwell2015) and generally leverage a larger amount of information per person (Kern et al., Reference Kern, Park, Eichstaedt, Schwartz, Sap, Smith and Ungar2016). However, using such approaches in a clinical setting requires patients to share private information from social media pages, and requires that each participant has a substantial amount of data to share in the first place.
In this study, we present the first evaluation of AI-based mental health assessments from language (henceforth language-based assessments) to predict future PTSD symptom trajectories of patients monitored in a clinical setting. Rather than social media, we utilize transcripts of oral history interviews from responders to the 9/11 attacks. We first examine whether existing (‘pre-trained’) predictive models (most of which were trained on social media) produce assessments associated with PTSD symptoms scores close to the time of interview. We then compare these language assessments to other information available within a mental health clinical cohort (e.g. age, gender, occupation) to evaluate the additional benefit of the AI-based assessments. Lastly, we seek to quantify the predictive power of language-based indicators, in part to assess their potential suitability for informing personalized therapeutic approaches.
The sample was derived from Stony Brook University's WTC Health & Wellness program, funded by the Centers for Disease Control and Prevention, that provides ongoing monitoring of WTC responders. A total of N = 124 responders underwent an oral history interview and agreed to allow researchers to merge data from the transcript of the oral history with information in their health monitoring records. Hammock et al. (Reference Hammock, Dreyer, Riaz, Clouston, McGlone and Luft2019) provide an extensive summary of data collection methods. Briefly, oral history participants were primarily recruited via word of mouth and by flyers posted in the Stony Brook WTC Wellness Program.
Each interview lasted approximately 1 h. It covered the responders' memory of 9/11 attacks and disaster relief efforts, their work activities at the site, experiences and sensations over the days and weeks that followed, and how the WTC disaster ultimately impacted their lives since. Interviews were conducted by clinical staff with diverse healthcare backgrounds after a comprehensive orientation in conducting guided interviews and eliciting details relevant to the key topics to be covered. Responders were encouraged to discuss what was most important to them. Interviews were completed between 2010 and 2018.
In order to restrict our sample responders who were not new to the WTC Health Program, the analysis sample was restricted to participants who had at least one valid score on the PTSD Checklist (PCL; Blanchard, Jones-Alexander, Buckley, & Forneris, Reference Blanchard, Jones-Alexander, Buckley and Forneris1996) within 2 years of their interview, and at least one pre-interview PCL yielding an analysis sample of N = 113 responders. The few newer health program enrollees who were excluded from this study were qualitatively different, having only had just begun care (and potential PTSD treatment) at interview time. Furthermore, to study longitudinal trajectories post-interview, we focused on the subset of individuals with at least three post-interview mental health assessments at least 2 years after the interview (N = 75). The demographic characteristics of the study samples are listed in Table 1. The demographic ratio of gender and police remained similar (<4% difference) after we limited the sample to responders who met the criteria for our language analysis; 92% of the subset group were male and 49% were police; their mean age at interview was 53.
This study was approved by the Stony Brook University Institutional Review Board. The participants provided written informed consent.
We automatically derived nine variables assessing the responders’ language during the interviews: four AI-based assessments of psychological traits (expression of anxiety, depression, neuroticism, and extraversion), three lexicon-based assessments of language style (first-person singular pronouns, plural pronouns, and use of articles), and two meta variables describing counts of words and lengths of words. The process to get these variables consisted of three steps: text transcription, conversion of text to linguistic features, and application of AI-based models or lexica.
Audio of each interview was transcribed into text using TranscribeMe, a HIPAA-approved transcription service. Each time the responders spoke, transcribers labelled the time and the words mentioned. The text of each interview was converted into ‘features’ – quantitative values describing the content of the interview language – and then input into: (a) four AI-based assessments of psychological traits, (b) three lexicon-based assessments of language style, and (c) two meta-variable extractions describing counts of words and lengths of words. All analyses, described below, were performed using the Differential Language Analysis ToolKit (DLATK) (Schwartz et al., Reference Schwartz, Giorgi, Sap, Crutchley, Ungar and Eichstaedt2017).
Conversion into linguistic features
The models we used required up to three types of linguistic features: (1) relative frequencies of words and phrases, (2) binary indicators of words and phrases, and (3) topic prevalence scores. Words and phrases are sequences of 1–3 words in a row. Their relative frequency was recorded by DLATK by counting each word or phrase mentioned and dividing by the total number of words or phrases mentioned by the responder. The binary indicator for words and phrases simply indicated whether each word or phrase shows up (1) or not (0). The tokenizer built into the DLATK package was used to extract words per interview.
Topics are weighted groups of semantically-related words, often derived through a statistical process called latent Dirichlet allocation (Blei, Ng, & Jordan, Reference Blei, Ng and Jordan2003). Once derived, topics can be applied to textual data to scoring, ranging from 0 to 1, indicating how frequently each group of words was mentioned (Kern et al., Reference Kern, Park, Eichstaedt, Schwartz, Sap, Smith and Ungar2016). We use a standard set of 2000 topics introduced by Schwartz et al. (Reference Schwartz, Eichstaedt, Kern, Dziurzynski, Lucas, Agrawal and Ungar2013a, Reference Schwartz, Eichstaedt, Kern, Dziurzynski, Ramones and Agrawal2013b), which has frequently been applied in the psychological domain including most recently in Eichstaedt et al. (Reference Eichstaedt, Kern, Yaden, Schwartz, Giorgi, Park and Ungar2020). Once extracted, features were mapped to nine coarse-grained scores as described below and used for analyses herein.
AI-based psychological traits (4)
The AI-based assessments input linguistic features such as words, phrases, and topics, and map them to psychological constructs (Kern et al., Reference Kern, Park, Eichstaedt, Schwartz, Sap, Smith and Ungar2016; Schwartz & Ungar, Reference Schwartz and Ungar2015). We focused on existing pre-trained models for constructs known to be related to our mental health outcomes: (1) neuroticism and (2) extraversion (Park et al., Reference Park, Schwartz, Eichstaedt, Kern, Kosinski, Stillwell and Seligman2015; Schwartz et al., Reference Schwartz, Eichstaedt, Kern, Dziurzynski, Ramones and Agrawal2013b) – the two factors of the five-factor model known to relate negatively to depression and anxiety-related mental health conditions (Farmer et al., Reference Farmer, Redman, Harris, Mahmood, Sadler, Pickering and McGuffin2002; Jorm et al., Reference Jorm, Christensen, Henderson, Jacomb, Korten and Rodgers2000; Jylhä & Isometsä, Reference Jylhä and Isometsä2006), as well as (3) degree of depression and (4) anxiousness (Schwartz et al., Reference Schwartz, Eichstaedt, Kern, Park, Sap, Stillwell and Ungar2014) – subfacets of emotional stability which correspond to negative high arousal language (anxiousness) and negative low arousal language (depressive). These models were trained on large and diverse populations (approximately sample sizes of N = 65 000 for neuroticism and extraversion and N = 29 000 for degrees of depression and anxiousness). They utilize the linguistic features of previously mentioned words and phrases as well as topics as input and output continuous scores for each of the four constructs. They have been validated against standard questionnaire-based measures as well as convergent factors and external criteria under a range of situations (Kern et al., Reference Kern, Park, Eichstaedt, Schwartz, Sap, Smith and Ungar2016; Matero et al., Reference Matero, Idnani, Son, Giorgi, Vu, Zamani and Schwartz2019; Park et al., Reference Park, Schwartz, Eichstaedt, Kern, Kosinski, Stillwell and Seligman2015; Schwartz et al., Reference Schwartz, Eichstaedt, Kern, Park, Sap, Stillwell and Ungar2014). However, the predictive validity of these models has yet to be assessed in clinical interview settings. Importantly, to guard against overfitting, no adjustments were made to the models, and thus this can be considered an evaluation of the models exactly as they were presented in their respective papers (Park et al., Reference Park, Schwartz, Eichstaedt, Kern, Kosinski, Stillwell and Seligman2015; Schwartz et al., Reference Schwartz, Eichstaedt, Kern, Park, Sap, Stillwell and Ungar2014).
Function word lexicon features (3)
We extracted word frequencies of terms in LIWC 2015 categories (Pennebaker, Boyd, Jordan, & Blackburn, Reference Pennebaker, Boyd, Jordan and Blackburn2015) and calculated categories for an interview with each responder. Due to the relatively low sample size, we focused on the function word categories that were most prevalent and then selected those that had a literature-suggested association with mental health:
• First-person singular: depressed, low status, personal, emotional, informal. Previously correlated positively with neuroticism, depression, and anxiety (Baddeley & Singer, Reference Baddeley and Singer2008; Holtzman, Reference Holtzman2017; Rude, Gortner, & Pennebaker, Reference Rude, Gortner and Pennebaker2004) and negatively with life satisfaction (Schwartz et al., Reference Schwartz, Eichstaedt, Kern, Dziurzynski, Lucas, Agrawal and Ungar2013a).
• First-person plural: high status, socially connected to group. Previously correlated negatively with depression and anxiety (Ramirez-Esparza, Chung, Kacewicz, & Pennebaker, Reference Ramirez-Esparza, Chung, Kacewicz and Pennebaker2008) and positively correlated with life satisfaction (Schwartz et al., Reference Schwartz, Eichstaedt, Kern, Dziurzynski, Lucas, Agrawal and Ungar2013a) along with the cognition and psychological well-being variables of our interest (Tausczik & Pennebaker, Reference Tausczik and Pennebaker2010).
• Articles: use of concrete nouns, interest in objects and things (Tausczik & Pennebaker, Reference Tausczik and Pennebaker2010).
Language meta features (2)
• Average word length is known to be associated with higher cognitive (Khawaja, Chen, & Marcus, Reference Khawaja, Chen and Marcus2010), conceptual complexity (Lewis & Frank, Reference Lewis and Frank2016), education, and social class (Hartley, Pennebaker, & Fox, Reference Hartley, Pennebaker and Fox2003; Tausczik & Pennebaker, Reference Tausczik and Pennebaker2010). PTSD is known to impair cognitive processing and impose a cognitive burden (e.g. through intrusive memories and thought suppression) (Nixon, Nehmy, & Seymour, Reference Nixon, Nehmy and Seymour2007).
• Word counts: We also recorded total word counts, the number by which all lexica above were normalized. Given the interviews were all an hour long, this is a proxy for the rate of speech from each participant.
Mental health outcomes
The PTSD Symptom Checklist for DSM-IV PTSD (PCL) was used to assess PTSD severity in the past month (Bromet et al., Reference Bromet, Hobbs, Clouston, Gonzalez, Kotov and Luft2016; Cone et al., Reference Cone, Li, Kornblith, Gocheva, Stellman, Shaikh and Bowler2015; Pietrzak et al., Reference Pietrzak, Feder, Singh, Schechter, Bromet and Katz2014). We chose the PCL closest to the interview date (all within 2 years) for concurrent analyses (average initial PCL score = 33.7; s.d. = 16.2). Following previous work which suggests that a fixed cutoff might not be optimally established for all cases (Andrykowski, Cordova, Studts, & Miller, Reference Andrykowski, Cordova, Studts and Miller1998; Bovin et al., Reference Bovin, Marx, Weathers, Gallagher, Rodriguez, Schnurr and Keane2016), we focused on continuous values. Post-interview PCL scores were used to create trajectories as described under trajectory prediction below.
We used linear regression coefficient of the target explanatory variable (PCL score) as its correlation strength and multivariable adjustment for possible confounders (age, gender, occupation, and years after 9/11) to acquire the unique effects of language-based assessments. On average, the interviews were conducted 10.31 years (s.d. = 1.43) after the event. We controlled for days since 9/11 in the analyses. Since we explored many language assessments at once, we considered coefficients significant if their Benjamini–Hochburg adjusted p values were <0.05.
We processed the interviews of responders who had PTSD assessments three or more interviews after the closest dates to interviews and at least one assessment before the closest dates to interviews for the stable future trajectory modeling. For our cross-sectional correlation analysis linking language-based assessments with PTSD, we selected PCL scores of WTC responders as their cross-sectional PTSD symptom severity at the time of the interview (Interview PCL), and it is controlled for future PTSD trajectories as a baseline.
For modeling the trajectory of PCL scores of each responder, we fit an ordinary least squares regression model with an intercept to the post-interview PCL scores as a function of time t:
where PCL scores were measured at (t) years after the interviews, then use the β 1i coefficient as a future PCL score trajectory of a responder (i). Then, for the person-level prediction over β 1i using the language-based assessments controlling the age, gender, occupation, and years between the interview and 9/11 of the responder i as following:
where x 1: language-based assessments, x 2: baseline PCL, x 3…6: age, gender, occupation, years after 9/11 (all variables standardized). Using equation (1) and (2), we use the following joint model:
and evaluate an effect size of each language-based assessment as its predictive power for future PCL trajectories of the responders (Fig. 1).
For the longitudinal trajectories post-interview, we focused on the subset of individuals with at least three post-interview PCL assessments occurring at least 2 years following the interview (N = 75). Sample demographics are reported in Table 1. Counting the interview, these criteria allowed the trajectories to be derived from at least four data points per participant, with the last assessment occurring on average (mean) 5.5 years (s.d. = 1.3) after the interview. By using this trajectory-based approach, all assessments available were used and aligned with their dates of administration.
Most responders were male (90%) and half (48%) were police (see Table 1 for sample characteristics). Their median age at their interviews was 55 (53 for the longitudinal cohort). The median number of words across the interviews was 10 254.
Associations between language-based assessments and PTSD severity
Table 2 shows the linear regression analyses linking language-based assessments and PCL scores among responders around their interview dates. Higher PCL scores were significantly associated with language-based assessments consistent with anxious, depressive, and neuroticism. High scores were also associated with greater use of first-person singular and more total count of words in their interviews (r > 0.22). Conversely, higher scores were also associated with less extraversion language patterns, first-person plurals, and articles. Results remained unchanged after adjusting for age, gender, occupation, and years after 9/11 despite some effects from covariates (<0.07).
Associations are from ordinary least squares over standardized independent variable – the language-based assessment and the standardized dependent variable – PTSD Checklist scores (PCL scores). Without controls is equivalent to Pearson Product-Moment Correlation (N = 75). Square brackets indicate 95% confidence intervals. Controls included as covariates (right column) included age, gender, occupation, days between 9/11/01 and interview date.
* Indicates significant correlations (multi-test, Benjamini–Hochburg adjusted p < 0.050). Each row is color-coded separately, from red (negative correlations) to green (positive correlations); greyed values indicate non-significant.
Table 3 shows that language-based assessments of the oral histories significantly predicted responders' PCL trajectories during the follow-up period. First, we calculated linear regression coefficient effect sizes when we modeled PCL score trajectories with language features only (first column of Table 3). Then we add the control variables into the model (the second column). Although the general directions of correlations were the same both with and without controls, suppression effects of control variables increased the effect sizes for anxiety and first-person plural usage (Fig. 2).
Associations are from ordinary least squares over standardized independent variable – the language-based assessment and the standardized dependent variable – PCL future trajectory. Without controls is equivalent to Pearson Product-Moment Correlation (N = 75) with controls: age, gender, occupation, days between 9/11/01 and interview date, and interview PCL score.
The goal of the present study was to examine whether AI-based language assessments developed in non-clinical contexts were reliably (1) associated with PTSD on self-reported questionnaires, and (2) able to predict the extent to which one's symptoms would get better or worse (trajectory) within a long-term clinical setting. The study found support for the view that language-based assessments could be reliably used in a clinical setting when processing naturalistic interviews: specifically, we found that language-based features were indicative of current functioning (supporting aim 1) and that language-based features could predict future PTSD symptom trajectories (supporting aim 2). This study, for the first time, suggested that AI assessments of interviews from a clinical sample not focused specifically on the topic of mental health could be used to identify features indicative of a person's current and future mental well-being.
There are three major implications from this work. First, AI-based assessments of interviews were associated with the assessment of mental health scores concurrently, supporting the first aim. Specifically, depressed language was associated with greater PTSD symptom severity, as is self-focused language. This corroborates the clinical conceptualization of PTSD as involving self-focused rumination that maintains PTSD symptoms over time (Michael, Halligan, Clark, & Ehlers, Reference Michael, Halligan, Clark and Ehlers2007).
Second, the use of more anxious language predicted increased PTSD symptoms in the future, even when adjusting for age, gender, occupation, and days since the 9/11 disaster. This suggests that while immediate PTSD severity is associated with low mood, a worsening of PTSD is determined by anxiety, rather than depression. These results may suggest that while immediate PTSD severity is reflected in affective experience, it may be the cognitive processes associated with anxiety (worry, rumination) that underlie future increases in PTSD symptoms. This dovetails with the accounts of PTSD that understand it to be maintained through rumination and worry (Michael et al., Reference Michael, Halligan, Clark and Ehlers2007).
Third, the use of more first-person plural pronouns (‘we’, ‘us’, ‘our’) predicted decreased PTSD symptoms in the future when adjusting for the confound variables. This supports research showing that social support is an important affordance that can buffer against and help alleviate the psychopathological load of a traumatic life event. Previous findings have suggested that processes associated with chronic sympathetic arousal (which include the chronic activation of the HPA-axis in states of hypervigilance) may be ‘buffered against’ by social interactions with kin and close others (e.g. McGowan, Reference McGowan2002).
Depressive language and current PTSD severity
Depressive language (β = 0.32; p = 0.049) and high usage of first-person singulars (β = 0.31; p = 0.049) were most highly correlated with high PCL scores even after accounting for age, gender, days since 9/11, and responder occupation. These associations were consistent with findings from prior studies showing an association of PTSD symptoms with increased risk of depression (Breslau, Davis, Peterson, & Schultz, Reference Breslau, Davis, Peterson and Schultz2000; Stander, Thomsen, & Highfill-McRoy, Reference Stander, Thomsen and Highfill-McRoy2014). Similarly, high usage of first-person singular in messages is negatively correlated with life satisfaction (Schwartz et al., Reference Schwartz, Eichstaedt, Kern, Dziurzynski, Lucas, Agrawal and Ungar2013a). Anxious and neurotic language patterns had strong positive correlations with PCL scores, which align with a previous study that identified avoidance and hyperarousal symptoms as frequently reported symptoms (Bromet et al., Reference Bromet, Hobbs, Clouston, Gonzalez, Kotov and Luft2016). For the associations between personality traits and PTSD symptoms, previous studies found that low extraversion and high neuroticism are associated with an increased risk of PTSD (Breslau, Davis, & Andreski, Reference Breslau, Davis and Andreski1995; Fauerbach, Lawrence, Schmidt, Munster, & Costa, Reference Fauerbach, Lawrence, Schmidt, Munster and Costa2000), and we observed the same patterns of our language-based extraversion and neuroticism with PTSD severity.
Predictors of PTSD symptom trajectories
We examined language-based assessments as a predictor of responders' PCL trajectories after their interviews. Usage of first-person plurals and longer average word lengths were most highly correlated with improvement in PTSD in all cases, whether adjusting for baseline PCL-score and demographics or not. For other language-based assessments, coefficient effect sizes increased when we accounted for confounding due to the suppression effects mainly attributable to PCL scores, age at interview, and gender (see online Supplementary Table S1).
Furthermore, we analyzed potential mediation effects of marital status to address whether differences in the use of pronouns were merely reflecting marital status although previous work does not suggest such a relationship (Simmons, Gordon, & Chambless, Reference Simmons, Gordon and Chambless2015). Our results showed that these two types of language-based assessments predicted beyond marital status as their correlations remained statistically significant after adjusting for both controls and marital status (see online Supplementary Tables S2 and S3). This demonstrates that these linguistic markers capture an orientation toward the self and others over and above marital status.
In line with an extensive literature in psychology, we observed the use of ‘I’ v. ‘we’ pronouns to mark classes of psychological processes that determined adjustment to and recovery from trauma. Previous work has related higher use of first-person singular pronouns (‘I’-talk) self-focus (Carey et al., Reference Carey, Brucks, Küfner, Holtzman, Back, Donnellan and Mehl2015); we found it correlated with high cross-sectional PTSD severity. On the other hand, we found high usage of first-person plural pronouns (‘we’) to be associated with a decrease of PTSD symptoms in the future. Self-focused thinking has been identified as a transdiagnostic factor of PTSD and depressive symptoms marking an often maladaptive preoccupation with the self and negative experience (Birrer & Michael, Reference Birrer and Michael2011; Ingram, Reference Ingram1990; Martin, Reference Martin1985). The use of ‘I’ pronouns, in turn, has previously been found to be a dependable marker of self-focus in natural language (Carey et al., Reference Carey, Brucks, Küfner, Holtzman, Back, Donnellan and Mehl2015; Watkins & Teasdale, Reference Watkins and Teasdale2001; Wegner & Giuliano, Reference Wegner and Giuliano1980). Beyond mere self-focus, depression and negative affectivity have also been robustly associated with higher use of first person singular pronouns (Holtzman, Reference Holtzman2017; Rude et al., Reference Rude, Gortner and Pennebaker2004) and PTSD (Miragoli, Camisasca, & Di Blasio, Reference Miragoli, Camisasca and Di Blasio2019); PTSD also with few ‘we’ pronouns (Papini et al., Reference Papini, Yoon, Rubin, Lopez-Castro and Hien2015). Our study showed further evidence for these patterns: greater use of ‘I’ pronouns positively correlated with severe cross-sectional PTSD symptoms, and high usage of ‘we’ pronouns predicted decreasing PTSD symptoms in the future.
This was the first study to evaluate the relationship between automatic language-based assessments from interviews and PTSD symptoms of a trauma population, and there were several limitations. First, our sample covered a particular cohort of trauma survivors, those responding to the WTC disaster. WTC responders are predominantly male, and members of the monitoring population eligible to participate in this study were predominantly police officers. As such, this study relied on a sample that is similar to the rest of the WTC responder population (Bromet et al., Reference Bromet, Hobbs, Clouston, Gonzalez, Kotov and Luft2016). Nevertheless, this may limit the generalizability of present findings to other occupations and demographic groups. Future research would need to investigate whether the results replicate to additional populations. Second, language-based assessment predicted future change in PTSD and suggested that cognitive and social risk processes may be involved, but mechanisms underpinning these predictive effects were not tested directly. Third, while our feature-based identification process was completed in a large database with ample capacity to train robust AI models, the present analysis had a relatively small sample size that could only be reliably used for application and was too small to retrain models for the current population. Future work in larger samples will be able to tailor AI-based assessments to specific populations and clinical questions substantially enhancing their predictive power.
Potential use in clinical care
Clinical evaluation of PTSD symptoms in trauma-exposed patients is time-consuming and burdensome. Moreover, primary care providers often lack expertise to complete this assessment. Our results show that natural language can provide clinically useful information both for the detection of PTSD and the prediction of future symptom escalation. These methods can be applied to routine clinical interviews completed by staff without mental health expertise. Although oral history interviews used in this project were lengthy, previous research has shown that interactions as brief as 5 min (e.g. 200 words) can be sufficient to obtain reliable AI-based assessments (Kern et al., Reference Kern, Park, Eichstaedt, Schwartz, Sap, Smith and Ungar2016). These assessments would not replace a psychiatric evaluation, but can be useful for screening in primary care and as an aid to psychiatrists, picking-up on diagnostic and prognostic features in language that may be missed clinically. Specific language-based risk factors could inform treatment selection, such as low social support, and may suggest group therapy or peer support interventions, whereas maladaptive cognitive styles suggest cognitive behavioral therapy.
We found automated AI-based assessments utilizing the language of WTC responders in their oral history interviews predicted their PTSD symptoms in both cross-sectional and longitudinal trajectory analyses. The patterns and the correlations from these studies should be examined cautiously, and may require independent confirmations from other WTC cohorts and across different types of exposures before general applications for PTSD treatments. Still, the patterns of language-based assessments consistent with previous findings in other settings and their strong statistical correlations provided unique insights and explanations beyond commonly known confounds or risk factors such as age, gender, occupation, marital status, or even questionnaire-based depression measures, suggesting support for clinicians toward more precise decisions. More generally, language-based assessments that capture individual digital phenotypes and distinctive linguistic markers from transcripts of interviews are very useful for investigating underlying causes of PTSD and may play a critical role as a supplement for enhancing personalized preventive care (Hamburg & Collins, Reference Hamburg and Collins2010) and more effective treatments for PTSD; they may even enable real-time screening or preventive measures with reduced costs and less therapist time for helping a large number of people exposed to large-scale traumatic events (e.g. natural disasters, WTC PTSD) similar to a previous online PTSD treatment (Lewis et al., Reference Lewis, Farewell, Groves, Kitchiner, Roberts, Vick and Bisson2017). Nevertheless, future studies with applying language-based assessment on larger samples will be required in order to more precisely validate their statistical significance and correlations, and even further studies into subphenotypes and more detailed categorizations of language-based assessments will lead to more diverse analysis with rich high-dimensional digital phenotypes.
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291721002294.
The authors are extremely grateful to the WTC rescue and recovery workers, who gave of themselves so readily in response to the WTC attacks and agreed to participate in this ongoing research effort. We also thank the clinical staff of the World Trade Center Medical Monitoring and Treatment Programs for their dedication and the labor and community organizations for their continued support. Son and Schwartz were supported, in part, by NIH R01 AA028032-01 and, in part, by DARPA via grant #W911NF-20-1-0306 to Stony Brook University; the conclusions and opinions expressed are attributable only to the authors and should not be construed as those of DARPA or the U.S. Department of Defense. Clouston was supported, in part, by NIH R01 AG049953.
Conflict of interest