Skip to main content Accessibility help

Automated text analysis in psychology: methods, applications, and future developments*



Recent years have seen rapid developments in automated text analysis methods focused on measuring psychological and demographic properties. While this development has mainly been driven by computer scientists and computational linguists, such methods can be of great value for social scientists in general, and for psychologists in particular. In this paper, we review some of the most popular approaches to automated text analysis from the perspective of social scientists, and give examples of their applications in different theoretical domains. After describing some of the pros and cons of these methods, we speculate about future methodological developments, and how they might change social sciences. We conclude that, despite the fact that current methods have many disadvantages and pitfalls compared to more traditional methods of data collection, the constant increase of computational power and the wide availability of textual data will inevitably make automated text analysis a common tool for psychologists.


Corresponding author

Address for correspondence: e-mail:


Hide All

This research has been supported in part by an AFOSR Young Investigator award to MD, and ARTIS research grant to RI. We are thankful to Jeremy Ginges, Sid Horton, Antonio Damasio, Jonas Kaplan, Sarah Gimbel, Kate Johnson, Lisa Aziz-Zadeh, Jesse Graham, Peter Khooshabeh, Peter Carnevale, and Derek Harmon for their helpful comments and suggestions.



Hide All
Andrzejewski, D., & Zhu, X. (2009). Latent Dirichlet allocation with topic-in-set knowledge. In Proceedings of the NAACL 2009 Workshop on Semi-supervised Learning for NLP (pp. 4348). Stroudsburg, PA: Association for Computational Linguistics.
Argamon, S., Koppel, M., Pennebaker, J. W., & Schler, J. (2009). Automatically profiling the author of an anonymous text. Communications of the ACM, 52 (2), 119123.
Back, M. D., Küfner, A. C., & Egloff, B. (2010). The emotional timeline of September 11, 2001. Psychological Science, 21 (10), 14171419.
Back, M. D., Küfner, A. C., & Egloff, B. (2011). ‘Automatic or the people?’ anger on September 11, 2001, and lessons learned for the analysis of large digital data sets. Psychological Science, 22 (6), 837838.
Baddeley, J. L., Pennebaker, J. W., & Beevers, C. G. (2013). Everyday social behavior during a major depressive episode. Social Psychological and Personality Science, 4 (4), 445452.
Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley Framenet project. In Proceedings of the 17th International Conference on Computational linguistics: Volume 1 (pp. 8690). Stroudsburg, PA: Association for Computational Linguistics.
Berger, H. (1929). Uber das Elektrenkephalogramm des Menschen [On the human electroencephalogram]. Archiv f. Psychiatrie u. Nervenkrankheiten, 87, 527570.
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55 (4), 7784.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 9931022.
Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3 (2), 77101.
Campbell, R. S., & Pennebaker, J. W. (2003). The secret life of pronouns: flexibility in writing style and physical health. Psychological Science, 14 (1), 6065.
Carley, K. (1997). Network text analysis: the network position of concepts. In Roberts, C. (Ed.), Text analysis for the social sciences: methods for drawing statistical inference from texts and transcripts (pp. 79100). Mahwah, NJ: Lawrence Erlbaum.
Chen, D. (2011). Introduction to latent Dirichlet allocation. Edwin Chen’s Blog. Online: <>.
Cohn, M. A., Mehl, M. R., & Pennebaker, J. W. (2004). Linguistic markers of psychological change surrounding September 11, 2001. Psychological Science, 15 (10), 687693.
Dave, K., Lawrence, S., & Pennock, D. M. (2003). Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proceedings of the 12th International Conference on World Wide Web (pp. 519528), online: <>.
D’Mello, S., Dowell, N., & Graesser, A. (2009). Cohesion relationships in tutorial dialogue as predictors of affective states. In Proceedings of the 2009 Conference on Artificial Intelligence in Education: Building Learning Systems That Care: From Knowledge Representation to Affective Modelling, online: <>.
D’Mello, S., & Graesser, A. (2012) Language and discourse are powerful signals of student emotions during tutoring. IEEE Transactions on Learning Technologies, 5 (4), 304317.
Dam, G., & Kaufmann, S. (2008). Computer assessment of interview data using latent semantic analysis. Behavior Research Methods, 40 (1), 820.
Dehghani, M., Bang, M., Medin, D., Marin, A., Leddon, E., & Waxman, S. (2013). Epistemologies in the text of children’s books: native- and non-native-authored books. International Journal of Science Education, 35 (13), 21332151.
Dehghani, M., Sagae, K., Sachdeva, S., & Gratch, J. (2014). Linguistic analysis of the debate over the construction of the ‘Ground Zero Mosque’. Journal of Information Technology & Politics, 11, 114.
Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19 (1/2), 109123.
Diermeier, D., Godbout, J. F., Yu, B., & Kaufmann, S. (2011). Language and ideology in Congress. British Journal of Political Science, 42 (1), 3155.
DiSessa, A. A. (1993). Toward an epistemology of physics. Cognition and Instruction, 10 (2/3), 105225.
Dumais, S. T., & Landauer, T. K. (1997). A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104 (2), 211240.
Eastham, L. A. (2011). Research using blogs for data: public documents or private musings? Research in Nursing & Health, 34 (4), 353361.
Esuli, A., & Sebastiani, F. (2006). Sentiwordnet: a publicly available lexical resource for opinion mining. In Proceedings of LREC 6 (pp. 417422), online: <>.
Finn, A., & Kushmerick, N. (2006). Learning to classify documents according to genre. Journal of the American Society for Information Science and Technology, 57 (11), 15061518.
Firth, J. (1957) Papers in Linguistics 1934−1951. London: Oxford University Press.
Foltz, P. W., Laham, D., & Landauer, T. K. (1999). Automated essay scoring: applications to educational technology. In World Conference on Educational Multimedia, Hypermedia and Telecommunications 1 (pp. 939944).
Freud, S. (1901). Psychopathology of everyday life. New York: Basic Books.
Gill, A. J., French, R. M., Gergle, D., & Oberlander, J. (2008). The language of emotion in short blog texts. In Proceedings of the 2008 ACM Conference on Computer Supported Cooperative Work, online: <>.
Gill, A. J., Nowson, S., & Oberlander, J. (2009). What are they blogging about? Personality, topic and motivation in blogs. In the Proceedings of 2009 International AAAI Conference on Weblogs and Social Media, online: <>.
Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36 (2), 193202.
Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96, 10291046.
Graham, J., Haidt, J., Koleva, S., Motyl, M., Iyer, R., Wojcik, S., & Ditto, P. H. (2013). Moral Foundations Theory: the pragmatic validity of moral pluralism. Advances in Experimental Social Psychology, 47, 55130.
Greenfield, P. M. (2013). The changing psychology of culture from 1800 through 2000. Psychological Science, 24 (9), 17221731.
Grimmer, J. (2010). A Bayesian hierarchical topic model for political texts: measuring expressed agendas in Senate press releases. Political Analysis, 18 (1), 135.
Haidt, J., & Joseph, C. (2004). Intuitive ethics: how innately prepared intuitions generate culturally variable virtues. Daedalus, 133, 5566.
Helmholtz, H. (1850). Vorläufiger Bericht über die Fortpflanzungsgeschwindigkeit der Nervenreizung [Preliminary report on the propagation speed of nervous stimulations]. Archiv für Anatomie, Physiologie und wissenschaftliche Medizin, 17, 7173.
Hookway, N. (2008). Entering the blogosphere: some strategies for using blogs in social research. Qualitative Research, 8 (1), 91113.
Houen, S. (2011). Opinion mining with semantic analysis. Online: <>.
Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features (pp. 137142). Berlin and Heidelberg: Springer.
Jung, C. G. (1904–1907) Studies in word association. London: Routledge & K. Paul [contained in Experimental Researches, Collected Works, Vol. 2].
Kacewicz, E., Pennebaker, J. W., Davis, M., Jeon, M., & Graesser, A. C. (2013). Pronoun use reflects standings in social hierarchies. Journal of Language and Social Psychology, 33, 124143.
Kahn, J. H., Tobin, R. M., Massey, A. E., & Anderson, J. A. (2007). Measuring emotional expression with the Linguistic Inquiry and Word Count. American Journal of Psychology, 120 (2), 263286.
Kesebir, P., & Kesebir, S. (2012). The cultural salience of moral character and virtue declined in twentieth century America. Journal of Positive Psychology, 7 (6), 471480.
Kim, S. M., & Hovy, E. (2004). Determining the sentiment of opinions. In Proceedings of the 20th International Conference on Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, online: <>.
Kim, S. M., & Hovy, E. (2006). Extracting opinions, opinion holders, and topics expressed in online news media text. In Proceedings of the Workshop on Sentiment and Subjectivity in Text. Stroudsburg, PA: Association for Computational Linguistics, online: <>.
King, G. (2011). Ensuring the data-rich future of the social sciences. Science, 331 (6018), 719721.
Kingsbury, P., & Palmer, M. (2002). From TreeBank to PropBank. In Proceedings of the Third International Conference on Language Resources and Evaluation, LREC-02, Las Palmas, Canary Islands, Spain.
Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for information Science and Technology, 60 (1), 926.
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110 (15), 58025805.
Krishnamurthy, B., Gill, P., & Arlitt, M. (2008). A few chirps about twitter. In Proceedings of the First Workshop on On-line Social Networks, pp. 1924.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104 (2), 211240.
Lerman, K., & Ghosh, R. (2010). Information contagion: an empirical study of the spread of news on Digg and Twitter social networks. In Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM), online: <>.
Lewis, D. D. (1998). Naive (Bayes) at forty: the independence assumption in information retrieval. In Nedellec, Claire & Rouveirol, Celine (Eds.), Machine learning: ECML-98 (pp. 415). Berlin and Heidelberg: Springer.
Lewis, K., Kaufman, J., Gonzalez, M., Wimmer, A., & Christakis, N. (2008). Tastes, ties, and time: a new social network dataset using Social Networks, 30 (4), 330342.
Liu, B. (2010). Sentiment analysis and subjectivity. In Indurkhya, Nitin & Damerau, Fred J. (Eds.), Handbook of natural language processing, 2nd ed. (pp. 627666). Boca Raton, FL: Taylor and Francis.
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In AAAI Workshop on Learning for Text Categorization, online: <>.
McCarthy, P. M., Lewis, G. A., Dufty, D. F., & McNamara, D. S. (2006). Analyzing writing styles with Coh-Metrix. In Proceedings of the Florida Artificial Intelligence Research Society International Conference, online: <>.
McCloskey, M. (1983). Naive theories of motion. In Gentner, D. & Stevens, A. (Eds.), Mental models (pp. 299324). New York, NY: Psychology Press.
McNamara, D. S., Crossley, S. A., & McCarthy, P. M. (2010). Linguistic features of writing quality. Written Communication, 27 (1), 5786.
Mehl, M. R., Gosling, S. D., & Pennebaker, J. W. (2006). Personality in its natural habitat: manifestations and implicit folk theories of personality in daily life. Journal of Personality and Social Psychology, 90 (5), 862877.
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., the_Google_Books_Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M., & Lieberman-Aiden, E. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331, 176182.
Miller, G. (1995). WordNet: a lexical database for English. Communications of the ACM, 38 (11), 3941.
Miller, G. (2012). The smartphone psychology manifesto. Perspectives on Psychological Science, 7 (3), 221237.
Mishne, G. (2005). Experiments with mood classification in blog posts. In Proceedings of ACM SIGIR 2005 Workshop on Stylistic Analysis of Text for Information Access, online: <>.
Mueller, G. E., & Schumann, F. (1894). Experimentelle beitrÃge zur untersuchung des gedächtnisses [Experimental contributions on the investigation of memory]. Zeitschrift fuer Psychologie, 6, 81190.
Mukherjee, A., & Liu, B. (2010) Improving gender classification of blog authors. In Proceedings of Conference on Empirical Methods in Natural Language Processing (pp. 207217). MIT, Massachusetts, USA, online: <>.
Murray, H. A. (1943). Thematic Apperception Test, Vol. 1. Cambridge, MA: Harvard University Press.
Nakov, P. (2001). Latent semantic analysis for German literature investigation. In Computational Intelligence. Theory and Applications, 834841, online: <>.
Newman, M. L., Groom, C. J., Handelman, L. D., & Pennebaker, J. W. (2008). Gender differences in language use: an analysis of 14,000 text samples. Discourse Processes, 45 (3), 211236.
Newman, M. L., Pennebaker, J. W., Berry, D. S., & Richards, J. M. (2003). Lying words: predicting deception from linguistic styles. Personality and Social Psychology Bulletin, 29 (5), 665675.
Nisbett, R. (2004). The geography of thought: how Asians and Westerners think differently … and why. New York, NY: Simon and Schuster.
Oberlander, J., & Nowson, S. (2006). Whose thumb is it anyway? Classifying author personality from weblog text. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (pp. 627634), online: <>.
Ogawa, S., Lee, T. M., Kay, A. K., & Tank, D. W. (1990). Brain magnetic resonance imaging with contrast dependent on blood oxygenation. Proceedings of the National Academy of Sciences, 87, 98689872.
Pennebaker, J. W. (2011). The secret life of pronouns: what our words say about us. New York: Bloomsbury Press.
Pennebaker, J. W., & King, L. A. (1999). Linguistic styles: language use as an individual difference. Journal of Personality and Social Psychology, 77 (6), 12961312.
Popping, R. (2003). Knowledge graphs and network text analysis. Social Science Information, 42 (1), 91106.
Pury, C. L. (2011). Automation can lead to confounds in text analysis: Back, Küfner, and Egloff (2010) and the not-so-angry Americans. Psychological Science, 22 (6), 835836.
Rorschach, H. (1964 [1921]) Psychodiagnostik: a diagnostic test based on perception, 6th ed. Berne: Huber.
Rude, S., Gortner, E. M., & Pennebaker, J. (2004). Language use of depressed and depression-vulnerable college students. Cognition & Emotion, 18 (8), 11211133.
Sagae, K., Gordon, A. S., Dehghani, M., Metke, M., Kim, J. S., Gimbel, S. I., … & Immordino-Yang, M. H. (2013). A data-driven approach for classification of subjectivity in personal narratives. In Proceedings of the 2013 Workshop on Computational Models of Narrative (pp. 198213), OASIcs XX, Scholss Dagstuhl, online: <>.
Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Agrawal, M., Park, G. J., … & Lucas, R. E. (2013). Characterizing geographic variation in well-being using tweets. In Seventh International AAAI Conference on Weblogs and Social Media (ICWSM 2013), online: <>.
Standifird, S. S. (2001). Reputation and e-commerce: eBay auctions and the asymmetrical impact of positive and negative ratings. Journal of Management, 27 (3), 279295.
Stirman, S. W., & Pennebaker, J. W. (2001). Word use in the poetry of suicidal and nonsuicidal poets. Psychosomatic Medicine, 63 (4), 517522.
Stone, P. J., Dunphy, D. C., & Smith, M. S. (1966). The general inquirer: a computer approach to content analysis. Oxford: MIT Press.
Strous, R. D., Koppel, M., Fine, J., Nachliel, S., Shaked, G., & Zivotofsky, A. Z. (2009). Automated characterization and identification of schizophrenia in writing. Journal of Nervous and Mental Disease, 197 (8), 585588.
Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29 (1), 2454.
Twenge, J. M., Campbell, W. K., & Gentile, B. (2012). Increases in individualistic words and phrases in American books, 1960–2008. PloS one, 7 (7), e40181.
Vaillant, G. E. (2012). Triumphs of experience: the men of the Harvard grant study. Cambridge, MA: Harvard University Press.
Van Dijk, T., & Kintsch, W. (1977). Cognitive psychology and discourse: recalling and summarizing stories. In Dressier, W. U. (Ed.), Trends in text-linguistics. New York/Berlin: De Gruyter.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.
Venegas, R. (2012). Automatic coherence profile in public speeches of three Latin American heads-of-state. In Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference, online: <>.
Vigouroux, R. (1879). Sur le role de la resistance electrique des tissuesdans le’electrodiagnostic. Comptes Rendus Societe de Biologie (Series 6), 31, 336339.
Villringer, A., & Chance, B. (1997) Non-invasive optical spectroscopy and imaging of human brain function. Trends in Neuroscience, 20, 435442.
Weber, E. U., Hsee, C. K., & Sokolowska, J. (1998). What folklore tells us about risk and risk taking: cross-cultural comparisons of American, German, and Chinese proverbs. Organizational Behavior and Human Decision Processes, 75 (2), 170186.
Williams, C., & D’Mello, S. (2010). Predicting student knowledge level from domain-independent function and content words. In Aleven, V., Kay, J., & Mostow, J. (Eds.), Intelligent tutoring systems (pp. 6271). Berlin and Heidelberg: Springer.
Wolff, P., Medin, D. L., & Pankratz, C. (1999). Evolution and devolution of folk biological knowledge. Cognition, 73 (2), 177204.
Xu, X., Murray, T., Smith, D., & Woolf, B. P. (2013). If you were me and I were you: mining social deliberation in online communication. In Proceedings of EDM-13, Educational Data Mining, online: <>.
Yano, T., Cohen, W. W., & Smith, N. A. (2009). Predicting response to political blog posts with topic models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 477485). Stroudsburg, PA: Association for Computational Linguistics.
Yarkoni, T. (2012). Psychoinformatics: new horizons at the interface of the psychological and computing sciences. Current Directions in Psychological Science, 21 (6), 391397.



Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed