Hostname: page-component-7bb8b95d7b-2h6rp Total loading time: 0 Render date: 2024-09-26T03:49:28.629Z Has data issue: false hasContentIssue false

TOWARDS A “TEXT AS DATA” APPROACH IN THE HISTORY OF ECONOMICS: AN APPLICATION TO ADAM SMITH’S CLASSICS

Published online by Cambridge University Press:  12 January 2023

Matthieu Ballandonne
Affiliation:
Matthieu Ballandonne: ESSCA School of Management, Angers, France. Email: matthieu.ballandonne@essca.fr.
Igor Cersosimo
Affiliation:
Igor Cersosimo: ESSCA School of Management, Angers, France. Email: igor.cersosimo@essca.fr.
Rights & Permissions [Opens in a new window]

Abstract

Quantitative techniques have received increasing attention in the history and methodology of economics. Nonetheless, a “text as data” approach has mostly been overlooked and its applicability to the history of economics remains to be examined. To understand what we gain from such quantitative techniques in relation to existing historical analyses, we apply some “text as data” techniques to Adam Smith’s The Theory of Moral Sentiments and The Wealth of Nations. We explore the books’ topics, styles, and sentiments. We show how word frequency analysis can be used to examine the differences between the books, shed light on conceptual discussions, and reveal an important stylistic aspect, specifically Smith’s use of personal pronouns. Style analysis shows the similarities and differences in terms of lexical richness and readability between the two books. We also show the limitations of a third technique, sentiment analysis, when applied to historical economic texts.

Type
Article
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of the History of Economics Society

I. INTRODUCTION

The fields of the history and methodology of economics have recently experienced a “quantitative turn.” The quantitative turn designates researchers’ increasing use of tools and methods such as bibliometrics, prosopography, network analysis, topic modeling, and text mining (Claveau and Gingras Reference Claveau and Gingras2016; Wright Reference Wright2016; Claveau and Herfeld Reference Claveau and Herfeld2018; Geiger and Kufenko Reference Geiger and Kufenko2018; Svorenčík Reference Svorenčík2018; Baccini Reference Baccini, Marcuzzo, Deleplace and Paesani2020; Edwards Reference Edwards2020).Footnote 1 Methods for text analysis especially have gained popularity in other fields such as sociology (Evans and Aceves Reference Evans and Aceves2016), political science (Grimmer and Stewart Reference Grimmer and Stewart2013), finance (Gupta et al. Reference Gupta, Dengre, Kheruwala and Shah2020), and the broader field of the digital humanities.

We adopt in this article a “text as data” perspective, which entails a shift in how researchers consider text. In a recent contribution, Kenneth Benoit explained that “the essence of treating text as data is that it is always transformed into more structured, summary and quantitative data to make it amenable to the familiar tools of data analysis” (Benoit Reference Benoit, Curini and Franzese2020, p. 463).

Economists have notably embraced the text as data approach. An example is the recent article “Text as Data” by Matthew Gentzkow, Bryan Kelly, and Matt Taddy (Reference Gentzkow, Kelly and Taddy2019), published in the Journal of Economic Literature. The authors “provide an overview of methods for analyzing text and a survey of current applications in economics and related social sciences” (Gentzkow, Kelly, and Taddy Reference Gentzkow, Kelly and Taddy2019, p. 537). They note, “The rise of text analysis is part of a broader trend toward greater use of machine learning and related statistical methods in economics” (p. 570). In other words, the text as data approach is becoming an influential methodological approach in economics and will also likely become an important new element in the toolbox of historians and methodologists of economics. According to Benoit (Reference Benoit, Curini and Franzese2020), the reasons for the growing popularity of the text as data approach are the increase of computational power, the huge volume of texts available in digital format, and the developments of quantitative measures and tools specifically dedicated to text analysis. That historians and methodologists of economics have rarely used text mining can, at first sight, appear as curious; in fact, their research consists in extracting and creating knowledge from what economists produce, which is—at a very fundamental level—text (see also Komine and Shimodaira Reference Komine and Shimodaira2018).

One reason for this low interest or use arguably lies in the lack of systematic training of historians and methodologists of economics on the quantitative and computational tools needed for the quantitative analysis of texts. This is especially true when considering that some software applications require coding skills and can have, therefore, a significant cost of entry. We believe this will likely change in the next few years and that, as in political science (Grimmer and Stewart Reference Grimmer and Stewart2013), knowing how to automatically access and extract information from (digital) texts will be a valuable competency for researchers in the field.

Another reason that can explain why scholars in the history and methodology of economics have not yet embraced the text as data approach is the lack of research examining its applicability to historical economic documents.

We contribute to filling this gap by asking: What kind of questions could historians of economics investigate with text mining? Are those techniques safely applicable in our field, given the specificities of the texts examined? How do quantitative results compare with those obtained using the traditional approach?Footnote 2 To answer, we adopt a case-study approach, applying text as data methods to Adam Smith’s books The Theory of Moral Sentiments (1759, TMS) and The Wealth of Nations (1776, WN). We examine the two books through the lenses of their topics, styles (lexical richness and readability), and sentiments. We chose Smith’s books because they are classics in the field, and because there are countless scholarly contributions about the relationship between them (e.g., Montes Reference Montes2003, Reference Montes and Montes2004). This enables us to compare the results of the quantitative and qualitative approaches more easily.

Our article can be read from three different perspectives. First, it is a showcase of the potential of the text as data approach in the history of economics. Second, readers can use our discussions of the strengths and limitations of different techniques to orient their own methodological choices. Third, the adoption of a quantitative approach sheds new or complementary light on topics discussed in Smith scholarship.

The studies by James Gherity (Reference Gherity1993), Vivienne Brown (Reference Brown1994), Daniel Klein and Michael Clark (Reference Klein and Clark2011), Jeffrey Binder and Collin Jennings (Reference Binder and Jennings2016), Hye-Joon Yoon (Reference Yoon2017), and Johan Graafland and Thomas Wells (Reference Graafland and Wells2021) are of special interest for us. Gherity (Reference Gherity1993) claimed that a text published in The Scots Magazine in 1763 was authored by Smith. To support his claim, he complemented the qualitative analysis with the use of quantitative measures of style (such as indicators of sentence lengths and collocations). In her 1994 book Adam Smith’s Discourse: Canonicity, Commerce and Conscience, Brown “attempts a reading of Smith’s works that engages with the complexity of the texts, and with their stylistic, figurative and rhetorical forms” (Brown Reference Brown1994, p. 3). We will examine Brown’s arguments later in the text. Klein and Clark (Reference Klein and Clark2011) study “the place of language denoting musicality or synchrony in the works of Adam Smith” (Klein and Clark Reference Klein and Clark2011, p. 413). They show that while such language is present in TMS, it basically disappears in WN. They contend that “Smith invokes many modes of common experience in developing the idea of coordinated sentiment. Synchrony, however, is certainly central” (Klein and Clark Reference Klein and Clark2011, p. 419). Yoon (Reference Yoon2017) has also offered in his book The Rhetoric of Tenses in Adam Smith’s The Wealth of Nations an “investigation of the grammatical features” (Yoon Reference Yoon2017, p. 1) of WN. Both Brown and Yoon adopted, however, a qualitative “text as text” approach. Binder and Jennings (Reference Binder and Jennings2016) have used topic modeling techniques to study the similarities between the main text of the 1784 edition of WN and its index. Graafland and Wells (Reference Graafland and Wells2021) have used a semantic network data-mining approach to examine the role of virtues in WN. They contribute to Smith scholarship by “applying new digital research and visualization techniques.… These techniques allow us to analyze what Smith meant by examining empirically (quantitatively) what he said” (Graafland and Wells Reference Graafland and Wells2021, p. 32). They “show with more quantitative precision than traditional scholarship that the invisible hand reading dramatically misrepresents both the nuance and the sum of Smith’s analysis” (Graafland and Wells Reference Graafland and Wells2021, p. 31). The authors acknowledge that, although they focused exclusively on WN for their purpose, “one of the possibilities of these new quantitative methods is to allow a comparative analysis between an author’s texts” (Graafland and Wells Reference Graafland and Wells2021, p. 34). We adopt such a comparative stance in this article.

Considering that the text as data approach represents a novelty for many scholars in the fields of the history and methodology of economics, we provide, in an online appendix, further details about our methodological choices and some complementary graphical representations.

II. DATA COLLECTION AND PREPROCESSING

We downloaded WN from the Project Gutenberg website (1784 or 1786 edition) and TMS (1790 edition) from the University of Oxford Text Archive.Footnote 3 To conduct the text analysis, we used R 4.1.3 (R Core Team 2022). We also used several R packages, some specifically dedicated to text analysis: the quanteda family of packages, version 3.2.1 (Benoit et al. Reference Benoit, Watanabe, Wang, Nulty, Obeng, Müller and Matsuo2018), and sentimentr version 2.9.0 (Rinker Reference Rinker2021).

We first cleaned the two documents by removing the headers and footers added to the original text by the hosting sites (for instance, the information related to the Gutenberg Project). We then converted the two documents into a corpus object, a specific R data type used to conduct textual analysis. To transform the text into data amenable to statistical study, a few further actions need to be considered: case handling, tokenization, stemming, and removal or not of stop words.

Case Handling

The behavior of several R functions we use can be affected by the upper or lower case of the text. As an example, some functions require capital letters to detect features in the text (such as names of people and places), or to split the text into sentences by looking for the period-space-capital-letter pattern. To guarantee the consistency of our results, we converted all our text to lower case before the analysis and used the original text with capital letters only when specific functions required it.

Tokenization

Tokenization is “the process of splitting a text into tokens. This is crucial for computational text analysis because full texts are too specific to perform any meaningful computations with” (Welbers, Van Atteveldt, and Benoit Reference Welbers, Van Atteveldt and Benoit2017, p. 250). Tokens are separated by spaces and punctuation and, therefore, “most often tokens are words, because these are the most common semantically meaningful components of texts” (Welbers, Van Atteveldt, and Benoit Reference Welbers, Van Atteveldt and Benoit2017, p. 250).

We decided to split hyphenated words. We did so because Smith makes frequent use of hyphens (in words such as “self-command” but also in constructs such as “praise-worthiness,” “fellow-creatures,” “fellow-feeling,” or in some instances where modern English would use a space, such as “dining-rooms”) and such use can affect some of our results, as we shall see.

Stemming

Stemming is the process of transforming a derived word into its root (i.e., transforming “thinking” or “thinker” into “think”). This step is often taken to reduce the dimensionality of data (and thus to reduce the computational resources needed for calculation) by aggregating similar words. We decided not to stem the tokens because we consider that all inflections of words must be investigated when studying an author’s style, especially regarding language diversity and readability.

Stop Words

We kept a flexible approach to the inclusion or removal of so-called stop words: “In topic modeling you will typically want to remove or stop-out high frequency words such as the, of, and, a, an, et cetera because these words carry little weight in terms of thematic or topical value” (Jockers and Thalken Reference Jockers and Thalken2020, p. 220). We decided to keep or remove the stop words on a case-by-case basis, depending on the specific goals of each analysis we carried out. We will see in our topics analysis that keeping stop words can also bring interesting results.

After taking these preliminary steps, we start our analysis by producing some summary statistics of the two books. Table 1 shows the number of tokens, the number of types (unique tokens), the number of sentences, and the number of pages. Note that the numbers in Table 1 are calculated after cleaning the texts, i.e., by removing numbers, punctuation, and symbols, and splitting hyphenated words. The number of pages is included for reference and refers to the printed versions of the books we cite, not to the digital versions we used for our analyses. Overall, these data show only how WN is a much longer book than TMS, with over twice as many sentences and tokens. The difference in the number of types is comparatively smaller and we will see how this data is used to measure the lexical richness of the books.

Table 1. Number of Tokens, Types, Sentences, and Pages for Each Book

III. SMITH’S TOPICS

The text as data approach offers the powerful possibility of quickly scanning through long documents, or series of documents, and providing synthetic indicators that facilitate the identification of the topics covered in the text. In this section, we assume no knowledge of the books and provide examples of the methods that can be used for topics detection. We use different and complementary word frequency analysis methods and apply them at different scales of study. We then evaluate the results in the context of existing knowledge about Smith’s books.

Detection of the Most Relevant Topics

We apply three methods to reveal the main topics (and the relationships among them) of the books: weighted term frequency, keyness, and co-occurrences. These methods bring a complementary picture of the topics covered in the two books. Using the three methods together will also show how basic methodological choices (such as excluding or not the stop words) can affect the results.

The most intuitive way to find the most relevant words would be to determine the words with the highest frequencies in the two books (see Shimodaira and Fukuda Reference Shimodaira and Fukuda2014; Matsuyama Reference Matsuyama2016; Komine and Shimodaira Reference Komine and Shimodaira2018; and Komine Reference Komine2019, who all focus on the most frequent nouns in the texts they analyze).Footnote 4 However, the most frequent words are usually language components that carry little context-specific meaning (such as the stop words discussed above), and this procedure treats books on their own, not as part of a broader corpus (made of the two books in our case). A more effective approach consists in determining the term frequency-inverse document frequency (TF-IDF) score for each token. Simplifying, the TF-IDF statistic measures the frequency of a token in a book (the TF component), correcting for the number of documents in the corpus that contain the token (the IDF component; see Silge and Robinson Reference Silge and Robinson2017).

Another, slightly different, method is that of keyness (e.g., Bondi and Scott Reference Bondi and Scott2010). Keyness is a measure of how characteristic a word is to one specific document, meaning that the word is especially frequent in that document (indicated as target), when compared with all other documents in the same corpus (indicated as reference). One approach (the one we use) to keyness is to calculate a chi-square statistic on the frequencies of a word in the target and reference document. We present in Table 2 the list of most characteristic words, detected using TF-IDF or keyness (with and without stop words).

Table 2. Most Characteristic Words for Each Book, According to Different Indicators

Assuming no prior knowledge of the two books, Table 2 shows that TMS is mostly characterized by words that belong to the sphere of emotions (such as “joy,” “sorrow,” “emotions,” “love,” and “anger”) and morality (such as “virtue” and “conduct,” or also “judgments” and “approbation”). “Sympathy” is the only word that appears among the most characteristic ones using both TF-IDF and keyness, which would prompt the analyst to investigate Smith’s use of this concept. The words that characterize WN are instead mostly economic notions (such as “price,” “money,” or “tax”). “Silver” is the only word that is detected among the most characteristic ones by both the methods we use.

Overall, from a methodological standpoint, both TF-IDF and keyness are successful in producing a concise picture of what the two books are about. We can also see how the two methods are complementary and how using them together can provide a more reliable and complete list of topics. In fact, the lists of most characteristic words that TF-IDF and keyness produce are different. For instance, words such as “joy,” “sorrow,” and “emotions” are among the most characteristic in TMS only when using TF-IDF. On the contrary, personal pronouns (which we further discuss below), “sentiments,” and “virtue” (and others) are among the most characteristic words only when using keyness. Similarly, “price,” “money,” and “labour” are detected among the most characteristic words in WN only when using keyness. We suggest using both methods in combination, especially when analyzing larger corpora of documents, when the specificities of the two methods (such as the need to choose a target and a reference for keyness, or the different impact of the IDF component for TF-IDF) would have a more pronounced effect.

An additional important finding is that, according to keyness, some stop words are characteristic to TMS (as opposed to WN): “we,” “our,” “us,” “he,” “his,” and “him.” This is uncommon, considering that stop words are normally frequent in most texts. In fact, these unusually frequent stop words are all personal pronouns and highlight an important stylistic difference between the two books. We will look deeper into this result later, as part of our analysis of Smith’s style. As we noted above, however, stop words are typically removed in natural language processing analysis, and thus an important feature of the books would go unnoticed by following that standard methodological practice.

Going beyond single words’ frequencies, further insights are gained by considering how words combine with one another. In fact, a word may have low frequency but be systematically associated with another one (i.e., appearing in the same sentence), for instance because they are part of a specific construct or expression, making the association relevant. We explore this possibility by calculating co-occurrences. We first remove the stop words and keep only the tokens that appear more than fifteen times in TMS and twenty times in WN (the different thresholds account for the different lengths of the books). We then calculate the phi (φ) coefficient for each pair of tokens present in the text. This coefficient is a correlation measure that scores how often two tokens appear in the same sentence versus how often they appear in different sentences. The coefficient varies between one (the two tokens always appear in the same sentence) and minus one (the two tokens never appear in the same sentence). We finally display the fifty most strongly correlated pairs using, as in Hiroyuki Shimodaira and Shinji Fukuda (Reference Shimodaira and Fukuda2014), Naoki Matsuyama (Reference Matsuyama2016), or Julia Silge and David Robinson (Reference Silge and Robinson2017), a network representation (Figure 1 and Figure 2).Footnote 5 We opted for a network representation instead of a standard table format because it is more suitable to displaying possible groups of words and how such words are interconnected.

Figure 1. A Network Graph of the Fifty Strongest Co-occurrences in TMS

Figure 2. A Network Graph of the Fifty Strongest Co-occurrences in WN

Some co-occurrences highlight relevant aspects related to the contents of the books. An important result for TMS is the detection of “impartial spectator” (top right corner of Figure 1). The two separate words do not occur frequently enough to appear among the most characteristic words we reported in Table 2. However, taken together, the two words form an important co-occurrence and thus a central expression in TMS. This is a well-known result but a crucial one if we assume no knowledge of the book. For TMS, we also find “self” and “command” (that appear as “self‑command” in the book—see bottom right corner of Figure 1). For WN, we find that “silver” (a word present also in Table 2) is often associated with “gold” (bottom center of Figure 2). Also, as seen on the bottom left part of Figure 2, Smith often discusses “Spain” and “Portugal” in the same sentence, or “East” “Indies” and “West” “Indies,” but in separate sentences (there is no line connecting “east” and “west”). We also see that “forts” and “garrisons,” two military terms, tend to appear together (center part of Figure 2), and that the “army” is usually associated with “standing” (forming the expression “standing army,” i.e., professional army) in Smith’s discussion (center right of Figure 2). In addition, the words “answering,” “occasional,” and “demands” (bottom left of Figure 2) are interconnected because they are part of the expression “answering occasional demands” that appears multiple times in WN.

Other co-occurrences are instead indicative of stylistic choices. We can see how Smith, in TMS, tends to include in the same sentence words that have opposite meaning: “right” and “wrong,” “pain” and “pleasure,” “approve” and “disapprove,” “beauty” and “deformity,” “merit” and “demerit” (Figure 1; this last co-occurrence even appears in the titles of several sections of Part II of the book). Arguably, this appears as a precise stylistic or rhetorical choice, possibly made to stress the contrast among the different points of view presented in the discussion.

As well, we can also use co-occurrences to identify recurring language patterns (possibly indicative of an author’s writing habits). An interesting one is the pair “former” and “latter,” which Smith uses very frequently in both TMS (bottom left of Figure 1) and WN (top left of Figure 2). Other examples in TMS are “one” and “another,” “greater” and “part,” and “go” and “along” (Figure 1).

We found co-occurrences to be an effective method in spotting relevant associations between topics and, especially, some of the author’s characteristic expressions. Notably, even the most attentive reader would struggle in keeping track of how frequently different words appear together and would probably miss at least some of the expressions we detected, especially those that are not very frequent.

Smith on War

We illustrate in this section an example of a specific conceptual discussion by examining how Smith discusses the notion of war (e.g., Minowitz Reference Minowitz1989; Coulomb Reference Coulomb1998; Aspromourgos Reference Aspromourgos2007; Hill Reference Hill, Hall and Hill2009; Paganelli and Schumacher Reference Paganelli and Schumacher2019). The first observation is that the terms “war” and its plural appear more often in WN (61.7 total occurrences every 100,000 words) than in TMS (22.9 total occurrences every 100,000 words), suggesting that Smith is more interested in discussing war within the context of an economic discussion rather than in the context of moral philosophy.

To get more insights about the semantic context in which Smith writes about war, we apply keyness (as seen before) to each book separately, where the target is now the group of sentences that contain the terms “war” or “wars,” while the reference is the group of sentences that do not contain said terms. This approach allows us to detect if there are some words that are characteristic to those sentences in which Smith mentions war, in contrast to those in which he does not. We discuss our results below and provide their graphical representations separately (see online appendix, Figure 2 and Figure 3).

Figure 3. MATTR Values throughout TMS and WN

Results for TMS show that, when mentioning war, Smith tends to use other related terms such as “enemies” and “faction.” Characteristic words are also “hazards,” “hardships,” and “tumult,” an indication of the struggles that Smith associates with war. It is noteworthy that words associated with war in TMS are not specifically “moral.” One caveat to these conclusions is that there are only a few sentences in TMS that mention war (we showed earlier that they were indeed more frequent in WN). In sentences that do not mention war, the list of characteristic words is not particularly meaningful.

Concerning WN, we see that “funding,” “contracted,” and “debt” are frequently mentioned in association with war. Hence, the economic costs associated with war are especially relevant in Smith’s discussion (see also, for instance, Paganelli and Schumacher Reference Paganelli and Schumacher2019). In contrast, terms such as “price,” “labour,” and “market” are more characteristic to the passages where Smith does not examine war. However, chi-square values for these terms are very low, indicating that such terms are similarly frequent in sentences mentioning war or not.

An interesting observation concerns “peace.” The word appears to be characteristically associated with “war” in WN, but not in TMS. This seems to suggest a different use of “peace” in the two books. We can confirm this intuition by looking at the co-occurrences for “peace” in TMS (see online appendix, Table 1, for the co-occurrences of the word “peace” calculated for the whole book), and we find how it is more strongly associated with words such as “citizens,” “society,” or “order.” The quantitative results thus suggest that “peace” in WN is used as an antonym of “war.” This is not the case in TMS, where “peace” is typically (that is, not exclusively) used as a synonym of social harmony.

To conclude this section on topic analysis, the simple methods we presented are useful for researchers approaching a new, potentially large, body of (historical) literature. They enable a quick and, as we saw, effective summary of the topics treated in the documents examined and how they relate to each other. They can also be used to provide deeper insights into how an author framed a specific conceptual discussion.

IV. SMITH’S STYLE

That style matters for the interpretation of historical documents and especially for the study of Smith’s texts has been shown in Brown’s Reference Brown1994 Adam Smith’s Discourse. Brown argues that “the stylistic forms of Smith’s texts are not homogeneous, and … this is indicative of their different ethical structures” (Brown Reference Brown1994, p. 3). As Brown also puts it: “an exploration of the literary style of TMS is in some measure constitutive of understanding the moral argument itself” (Reference Brown1994, p. 26). In analyzing Smith’s style, Gherity (Reference Gherity1993) distinguished between “soft” and “hard” elements of style. The “soft” elements are about the structure of an argument, the writing flow, the choice of a specific terminology reflecting an author’s writing habit or background knowledge, et cetera (Gherity Reference Gherity1993, pp. 250–257); while the “hard” elements are those that “can be measured and subjected to statistical analysis” (Gherity Reference Gherity1993, p. 257).

Adopting the text as data approach, we focus on the “hard” elements of style of TMS and WN. To do so, we measure the lexical richness and the readability of the two books. Further developing the results shown in the topics detection section, and complementing Brown’s (Reference Brown1994) work, we also compare the role of personal pronouns in the two books. We conduct our analysis on each section of the books.Footnote 6 To facilitate the comparison between the two books, which are of different lengths and structures, we use a percentage index to show the position of a section in a book.

Lexical Richness

Many indicators exist for the study of lexical richness (or lexical diversity). We focus on three of the most widely used, calculating them for each section of the books and comparing the results. We initially considered two indicators for measuring lexical richness: the type-token ratio (TTR) and the hapax richness. The TTR divides the number of unique or different words (types) by the total number of words (tokens) (Jockers and Thalken Reference Jockers and Thalken2020). The approach of the hapax richness is similar to that of TTR, but with the focus shifted on words appearing only once. A word appearing only once is called a hapax legomenon and the hapax richness is the ratio of the number of hapax legomena to the total number of words of the text considered (Jockers and Thalken Reference Jockers and Thalken2020). The idea for both indicators is the same: a text with a larger share of types or hapax legomena on the total number of tokens is lexically richer.

As we show in the appendix (online, section II), these two indicators are nevertheless not suitable for the study of texts of different lengths, as their values are related to the length of the text—a known result that we also find in our case. The effect would be that longer texts would typically be scored as lexically poorer. We therefore opted for another indicator that, by construction, is neutral to text length: the moving-average type token ratio (MATTR) (Covington and McFall Reference Covington and McFall2010).

MATTR overcomes TTR’s limitations by calculating TTRs on a fixed length moving window of consecutive words, and finally using the mean of these TTRs as the final score. The researcher should decide on the length of the window (we use a window of 500 words as suggested by Covington and McFall Reference Covington and McFall2010, p. 97).

Figure 3 shows MATTR scores for each section of the two books, applying a smoothing algorithm to better visualize trends. According to MATTR, TMS is lexically richer than WN.

How can we explain that WN is, according to MATTR values, generally lexically poorer than TMS? One hypothesis is that Smith adapted his style to the different audiences of the books, with philosophers being used to comparatively lexically richer texts.Footnote 7 It could also be that moral philosophy offers more opportunities for lexically richer discussions than economics. Jack Weinstein (Reference Weinsteinn.d., in the Internet Encyclopedia of Philosophy) also argued, “As he grew older, Smith’s writing style became more efficient and less flowery.” If by “flowery” Weinstein means lexically rich, then our results support this conjecture. However, research has shown that lexical richness generally evolves during a writer’s lifetime (Smith and Kelly Reference Smith and Kelly2002 show that the language gets “richer” for two of the three authors they study). Our results show that Smith’s language did not become richer with time.

Readability

After a study of Smith’s Lectures on Rhetoric and Belles Lettres (see Lothian Reference Lothian1963), Brown (Reference Brown1994) has noted that Smith’s “writing appears easy and elegant, not obscure, but in spite of its own precept it was at times deeply metaphorical and intricately textured” (Brown Reference Brown1994, p. 20). Are Smith’s books indeed easy to read? Is there a difference between TMS and WN? To answer these questions, we calculate readability measures for both books.

Two popular indicators to measure readability are the Flesch (Reference Flesch1948) reading ease score and the Flesch-Kincaid (Kincaid et al. Reference Kincaid, Fishburne, Rogers and Chissom1975) grade level score. Despite different formulations, both measures are based on the idea that a text with longer sentences (number of words per sentence) and longer words (number of syllables per word) are comparatively more difficult to read than a text with shorter sentences and words. Readability scores are thus about the structure of sentences and words, not their meaning. We provide details about the differences between the two indicators in the appendix (online, section III), while we focus below on the results of the Flesch-Kincaid grade level score. This indicator expresses readability as the grade level (based on the US school system) required to understand the text. The interpretation is thus straightforward: the higher the Flesch-Kincaid score, the more complicated the corresponding text is. We calculate the Flesch-Kincaid score for each section of Smith’s books and present our results in Figure 4.

Figure 4. Flesch-Kincaid Grade Level Scores throughout TMS and WN

We can see that both books generally require a level of education between grades fourteen and twenty to be read, where grade fourteen roughly corresponds to the second year of college education. Hence, according to this indicator, Smith’s books can be considered as difficult to read.Footnote 8 This contradicts, albeit from a purely quantitative perspective, Brown’s (Reference Brown1994) characterization of Smith’s style (see above). By a small margin, WN requires a lower grade of education than TMS. Once again, the intended audiences of the two books may have played a role in determining Smith’s stylistic choices, since we can hypothesize that eighteenth-century philosophers had a higher level of education than merchants of the same period.

Breaking down the Flesch-Kincaid score into its components sheds light on our results and offers additional insights into the readability differences between the two books. As we said, the Flesch-Kincaid score considers that a text is more difficult to read when it is characterized by long sentences and long words. The boxplots in Figure 5 show the distributions of the lengths of sentences and words in the books.

Figure 5. Boxplots of the Distribution of the Lengths of Sentences and Words in TMS and WN

Concerning the length of sentences, there is not much difference between TMS and WN, with sentences being mostly between thirty and forty words long. Concerning the length of words, we can see that TMS contains generally longer words: around 50% of sections in TMS have an average word length greater than 1.58 syllables, while only around 25% of sections in WN use similarly long words. It is reasonable to assume that the choice of words is, at least to a certain extent, driven by the topics examined, and that the differences in the lengths of the words between the two books is less of a stylistic choice than the length of sentences. Since this last indicator is practically identical in the two books, we can conclude that this aspect of Smith’s style did not change over time.

To conclude, the study of Smith’s two books through the lenses of readability and lexical richness indicators shows that TMS is marginally more complicated and lexically richer than WN. Noticeably, however, the length of sentences (which is arguably the more stylistic component of readability) is almost identical in the two books.

Smith’s Voices

We noted earlier, using keyness, that, compared with WN, TMS is characterized by the frequent use of certain stop words, all personal pronouns: “we,” “our,” “he,” and “his” (“us” is also a pronoun characteristic to TMS but is not in the list of stop words). This result is even more striking because these pronouns are more characteristic to TMS than moral words. Smith’s different use of personal pronouns in the two books thus appears to be worthy of investigation. We therefore examine in Figure 6 the relative frequencies of all personal pronouns in Smith’s two books. In line with the keyness results, we see that Smith used “I,” “you,” “he,” “him,” “we,” and “us” very frequently in TMS but not in WN. The overall picture is that the personal pronouns stand out more in TMS than in WN.

Figure 6. Relative Frequencies (%) of Personal Pronouns in TMS and WN

Showing again that the quantitative and qualitative approaches are complementary, we can frame our discussion of the above results by using some of the arguments presented in Brown’s Reference Brown1994 Adam Smith’s Discourse. Brown has argued that the styles in TMS and WN are different and that these different styles are representative of their different “moral plane” (Brown Reference Brown1994, p. 5). More specifically, Brown used Mikhail Bakhtin’s distinction between a dialogic and a monologic text:

Bakhtin argued that what he called novelistic discourse is characterised by a ‘dialogic’ style incorporating a range of ‘voices’ representing different cultural, political and axiological views, and within this category he included ethical discourses such as Stoic and other self-interrogative discourses. (Brown Reference Brown1994, p. 4)

Bakhtin counterposed the dialogic form of novelistic discourse to its other: the monologic form of discourse, where a single unitary voice of authority or tradition is at work controlling the text (texts such as scientific discourse or what Bakhtin termed the epic form). (Brown Reference Brown1994, p. 26)

Brown argued that TMS would be dialogic while WN would be monologic: “The openness and uncertainty of the text of TMS … is rendered stylistically in its multivocal dialogism. By comparison, WN stands as a largely single-voiced monologic text, with its expressed certainties and intellectual order” (Brown Reference Brown1994, p. 37). Our analysis based on keyness and frequency of personal pronouns tends to confirm this result: the different “voices” (pronouns) are more frequent in (and more characteristic to) TMS than in WN, showing how TMS would be relatively more “multivocal” than WN.

Nevertheless, as Brown warned, a monologic text does not mean that there are no other voices but that the other voices are used to support the arguments of the main voice: “WN is not undifferentiated stylistically, but a crucial contrast with TMS is that these different kinds of authorial interventions do not challenge the formal meaning of the didactic voice” (Brown Reference Brown1994, p. 38). Our results also confirm that WN is not “undifferentiated stylistically” in terms of voices. Indeed, even though relatively less frequent, all personal pronouns are also found in WN. This study of personal pronouns thus also highlights a limitation of the quantitative method. As we said, the distinction between the dialogic and monologic styles are not only about the presence of voices but about how they interact and give room to each other. Brown’s (Reference Brown1994) above argument can thus be fully appreciated only by careful reading of the books.

V. SENTIMENT ANALYSIS

We now apply sentiment analysis to Smith’s two books. Bing Liu (Reference Liu2020) defines sentiment analysis as “a field of study that aims to extract opinions and sentiments from natural language text using computational methods” (Liu Reference Liu2020, p. xi). Sentiment analysis is also of particular interest for historians and methodologists of economics because of developments in the discipline with the “emerging field” of “sentometrics, which is a portmanteau of sentiment and econometrics” (Algaba et al. Reference Algaba, Ardia, Bluteau, Borms and Boudt2020, p. 513; see also the sentometrics R package by Ardia et al. Reference Ardia, Bluteau, Borms and Boudt2021).

Sentiment analysis works by splitting the text into single words and then looking for the sentiment of each word in a lexicon, which is a dictionary in which words are associated with a sentiment score. We used a version of the method (implemented in the R package sentimentr) that also considers valence shifters (see details in the online appendix).

Note that sentimentr’s default lexicon (called Jockers and Rinker’s polarity table, see Rinker Reference Rinker2021) is based on US English, while Smith uses British English. To improve the reliability of our results, we expanded sentimentr’s default lexicon by including the British spelling of words. We used the variant conversion information built in the Spell Checker Oriented Word Lists (SCOWL; see Atkinson Reference Atkinson2019), keeping for those the same scores as the corresponding American words.

Figure 7 gives two graphical representations of the sentiment scores for all sentences in the two books. The y-axis in the graph on the left indicates the direction (positive- negative) and the strength of the sentiment measured in each sentence. Positive sentiment would reflect feelings such as happiness, approval, delight, satisfaction, thankfulness, et cetera; negative sentiment would reflect sadness, agitation, agony, danger, failure, pain, et cetera. Values close to zero are essentially neutral. The black lines are obtained by smoothing the sentiment values. The cloud of points around the smoothing lines shows the strong variability of the sentiment values, not immediately appreciable by looking only at the smoothed lines.Footnote 9 Some patterns are visible, most notably the inversion of the two books, in terms of which one has the most positive sentiment (WN in the first half, TMS in the second half). Such patterns are, however, not very pronounced, and smoothed values are always within a narrow interval: positive but close to zero.

Figure 7. Left: Sentiment Scores for All Sentences throughout TMS and WN; Right: Boxplots of the Distribution of Sentiment Values in TMS and WN

The boxplots on the right side of Figure 7 complement the picture. We can see how most of the sentiment scores are moderately positive but with substantial variability. Most notably, many sentences have sentiment scores so positive or so negative that they fall outside the whiskers of the boxplot.

The large variability in the sentiment scores is an indication that, from the point of view of sentiment analysis, Smith used a varied and emotionally charged vocabulary. This can be illustrated by considering specific sentences whose sentiment score is either particularly positive or particularly negative. We can take this sentence from TMS:

But though these three passions, the desire of rendering ourselves the proper objects of honour and esteem; or of becoming what is honourable and estimable; the desire of acquiring honour and esteem by really deserving those sentiments; and the frivolous desire of praise at any rate, are widely different; though the two former are always approved of, while the latter never fails to be despised; there is, however, a certain remote affinity among them, which, exaggerated by the humorous and diverting eloquence of this lively author [Mandeville], has enabled him to impose upon his readers. (Smith [1759] Reference Smith and Haakonssen2002, pp. 365–366)

The sentence is filled with positive words (passions, desire, proper, esteem, acquiring, deserving, praise, approved, affinity, humorous, eloquence, lively, enabled). Some negative words are also present, but far less numerous (frivolous, fails, despised, exaggerated, impose).

An example of a negative sentence is when Smith writes in WN that “a simple augmentation is an injustice of open violence; whereas an adulteration is an injustice of treacherous fraud” (Smith [1776] Reference Smith and Cannan1977, p. 1260). The sentence contains several extremely negative words (injustice, violence, adulteration, injustice, treacherous, fraud) and not a single positive one.

While similar examples can be found across the two books, it would be difficult to draw any strong conclusion on the books’ general sentiments from the quantitative results. Indeed, explaining the pronounced variability of the sentiment scores would require further research, but we can make some hypotheses.

First, sentiment analysis is not usually applied to long historical texts. The method is usually applied to determine the sentiments of short and contemporary texts (tweets, online reviews) or of documents in which emotions play a relevant role (stories, novels). Smith’s books are, as we have seen, difficult to read, with long and complex sentences, and this could have an impact on the capability of the method to score each sentence.

Second, the method could fail because of the limitations of the lexicon used. As we noted, we extended the default lexicon by including the British spelling (used by Smith), but the two books are about specific subjects (moral philosophy versus commerce) and the jargon used in them is often outside the scope of the lexicon, which has been developed for other purposes. The fact that the lexicons currently in use have not been developed for the analysis of historical economic texts carries another consequence: the sentiment score associated with some words could be a poor fit in the specific context of Smith’s writings. In WN, Smith frequently uses words like “gold,” “wealth,” and “money,” and such words are all associated with a positive sentiment. We argue, however, that similar words would better be associated with a neutral sentiment, when they are used in the context of a text on commerce. The same could be said for words with a negative sentiment score, like “tax,” “payment,” or “labour.”

To conclude, despite our best efforts to adapt an existing and popular (in other fields) methodology to the context of historical economic texts, the results do not seem to be readily interpretable. Consequently, our recommendation would be to not use (or to use cautiously) sentiment analysis, in its present state, for the study of historical economic documents.

VI. CONCLUSION

We have seen that a text as data approach can shed light on some important aspects of a corpus of historical economic documents, albeit with some caveats and limitations.

The most effective methods we applied were those used for topics detection. Assuming no prior knowledge of the books, the quantitative analysis allowed us to correctly identify the key issues discussed in each book and the differences between the two books. The corpus we analyzed was small but, given the quality of our results, we expect the methods we use to be safely applicable to larger corpora of historical economic texts. We also showed how quantitative tools can be used not only to provide summary information on the overall topics of the books but also to shed light on a specific conceptual discussion. We indeed identified the different contexts in which Smith mentioned “war” and “peace” in the two books. A limitation of these methods is that they partially rely on a few methodological choices concerning, for instance, the minimum frequency for a word to be detected as relevant, or the inclusion or removal of stop words. Another similar issue is that the results of topic analysis using keyness depend on the choice of the target and reference documents. While this was not a problem in our case (since we have only two documents), applications to larger corpora would require special attention on this point.

Concerning Smith’s style, our study combines measures of readability and lexical richness, together with insights coming from word frequency analysis. The analysis returned useful insights into the components of Smith’s style. Measures of readability and lexical richness could be of interest for historians since they can be fruitfully used to inquire about the spread of ideas or texts (the hypothesis could be that documents that are easier to read spread more easily). These are, however, synthetic indicators that can be better appreciated in context, for instance by comparing the results with other documents of the same era or the same author. Important stylistic insights came also from using keyness and word frequency analysis, which allowed us to notice the importance of pronouns in TMS and, especially through the use of co‑occurrences, to spot some expressions that are characteristic to Smith’s writing. These stylistic elements are of interest in themselves and could also prove useful in other contexts. An example is their use in authorship attribution studies, i.e., solving a dispute over the authorship of a specific document (see, for instance, Gherity Reference Gherity1993; Martindale and McKenzie Reference Martindale and McKenzie1995) or determining the author when unknown (for a review of possible applications of authorship attribution, and methods used in that field, see Koppel, Schler, and Argamon Reference Koppel, Schler and Argamon2009; and Stamatatos Reference Stamatatos2009).

In addition, even though it is a popular methodology in data science and machine learning applications, we recommend scholars be cautious in applying sentiment analysis to lengthy historical economic documents. We could not interpret the results of the sentiment analysis because of the extreme variability of the sentiment scores calculated (we suggested some explanations for this variability).

The limitations of the text as data approach that we highlighted are a reminder that the quantitative and qualitative approaches are not mutually exclusive. As Justin Grimmer and Brandon Stewart (Reference Grimmer and Stewart2013, p. 268) put it: “the complexity of language implies that automated content analysis methods will never replace careful and close reading of texts … the methods … are best thought as amplifying and augmenting careful reading and thoughtful analysis” (see also Claveau and Herfeld Reference Claveau and Herfeld2018, p. 602). In this sense, despite its promises, the text as data approach will not provide a definitive answer to Das Adam Smith Problem.

As for future research, an immediate next step to this article would be to extend the analysis to Smith’s other contributions (such as his Reference Smith, Meek, Raphael and Stein1763 Lectures on Jurisprudence), other editions of his books (see Paganelli Reference Paganelli2011), or texts by other authors for style and topics comparisons. More broadly, many more text as data methods exist or are to be developed in this active field of research; and their possible applications are vast. Such applications could include authorship attribution or the study of economists’ speeches, economics and financial news, institutional reports, or tweets concerning economists or economics.

SUPPLEMENTARY MATERIAL

To view supplementary material for this article, please visit https://doi.org/10.1017/S1053837222000104

Footnotes

We are very grateful to the reviewers and the editor, Pedro Garcia Duarte, for their helpful comments and advice. We are also indebted to Maria Pia Paganelli and Philippe Fontaine for their suggestions on early drafts of the article, and to Robert Lysiak for his careful proofreading. Previous versions of the article were presented at the internal research seminar of the Institut de la Transformation Digitale (2021) at ESSCA School of Management (Angers, France), and at the online workshop on “Digital Methods in History and Economics” (2021) organized by the Universität Hamburg (Hamburg, Germany). We thank the organizers of these events and all participants for the helpful discussions.

1 See also the 2018 special issue of the Journal of Economic Methodology 25 (4); Œconomia’s call for papers (https://journals.openedition.org/oeconomia/10865, accessed June 10, 2021); or the workshop on “Digital Methods in History and Economics” organized by the Universität Hamburg, October 14–15, 2021.

2 Other authors have written working papers on applications of text as data on historical economic documents: Shimodaira and Fukuda (Reference Shimodaira and Fukuda2014, on Ricardo, J. Mill, and H. Martineau), Matsuyama (Reference Matsuyama2016, on Marshall), Komine and Shimodaira (Reference Komine and Shimodaira2018, on Keynes and Beveridge), Komine (Reference Komine2019, on Keynes). Two more publications appear to be dealing with this subject, Shimodaira (Reference Shimodaira2019) and Komine (Reference Komine2021), but they are available only in Japanese.

3 The two books have been downloaded on April 4, 2020, from https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/3189 (TMS) and https://www.gutenberg.org/ebooks/3300 (WN). The digital version of TMS that we downloaded does not contain an indication of the edition, but we concluded it must be the sixth edition (1790) because it contains Chapter III of Section III—Part I, “Of the corruption of our moral sentiments …,” which was added in that edition. For the differences between TMS editions, see Matson (Reference Matson2020a and Reference Matson2020b). About WN, the Project Gutenberg file does not provide information about the edition year. Some characteristics of the text suggest that this is the electronic version of the 1784 third edition or of a later one (on the differences between the editions of WN, see Cannan Reference Cannan and Cannan1976). For instance, Smith writes “now (1784)” and the text contains the chapter “Conclusion of the Mercantile System,” which was added in the third edition of the book. Nevertheless, our version does not contain the index, which was also added in the third edition. Similarly, Smith’s preface is not included and footnotes are within the text between curly brackets.

4 See also Trincado and Vindel (Reference Trincado and Vindel2015), who calculated the frequency of a small set of words within the context of an econophysics analysis (on Smith’s The Wealth of Nations, Jevons, and Marx).

5 The spatial distribution of the nodes, and the thickness and length of the edges, have no meaning in our graphical representation.

6 A section corresponds to a unit of text: “part,” “book,” “section,” or “chapter” (the structure of both books is a combination of such elements). Titles of the books are not included. Also, we do not split for lower-level units (such as a “digression”).

7 We thank Philippe Fontaine for raising this point.

8 As a contemporary comparison point, McKinley (Reference McKinley and Goetz2010, p. 384) has shown that books of the Harry Potter series by J. K. Rowling have Flesch-Kincaid grade scores ranging from 4.9 to 6.5 when dialogues among the characters in the novel are included in the analysis, and from about 5.5 to 9.4 when dialogues are removed. More broadly, LaRocque (Reference LaRocque2003) has emphasized that “most Americans—even the highly educated—prefer to read at a grade level of 10 or below” (LaRocque Reference LaRocque2003, p. 15).

9 For similar reasons we discarded the default graphical representation given by the sentimentr R package (see the online appendix for details).

References

REFERENCES

Algaba, Andres, Ardia, David, Bluteau, Keven, Borms, Samuel, and Boudt, Kris. 2020. “Econometrics Meets Sentiment: An Overview of Methodology and Applications.” Journal of Economic Surveys 34 (3): 512547.CrossRefGoogle Scholar
Ardia, David, Bluteau, Keven, Borms, Samuel, and Boudt, Kris. 2021. “The R Package Sentometrics to Compute, Aggregate, and Predict with Textual Sentiment.” Journal of Statistical Software 99: 140.CrossRefGoogle Scholar
Aspromourgos, Tony. 2007. “Adam Smith, Peace and Economics.” AQ: Australian Quarterly 79 (5): 1240.Google Scholar
Atkinson, Kevin. 2019. “Spell Checking Oriented Word Lists (SCOWL).” SCOWL (And Friends). http://wordlist.aspell.net/. Accessed March 15, 2020.Google Scholar
Baccini, Alberto. 2020. “A Bibliometric Portrait of Contemporary History of Economic Thought.” In Marcuzzo, Maria C., Deleplace, Ghislain, and Paesani, Paolo, eds., New Perspectives on Political Economy and Its History. Cham: Springer, pp. 3962.CrossRefGoogle Scholar
Benoit, Kenneth. 2020. “Text as Data: An Overview.” In Curini, Luigi and Franzese, Robert, eds., The SAGE Handbook of Research Methods in Political Science and International Relations. London: SAGE Publications Ltd, pp. 461497.CrossRefGoogle Scholar
Benoit, Kenneth, Watanabe, Kohei, Wang, Haiyan, Nulty, Paul, Obeng, Adam, Müller, Stefan, and Matsuo, Akitaka. 2018. “Quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774.CrossRefGoogle Scholar
Binder, Jeffrey M., and Jennings, Collin. 2016. “‘A Scientifical View of the Whole’: Adam Smith, Indexing, and Technologies of Abstraction.” ELH 83 (1): 157180.CrossRefGoogle Scholar
Bondi, Marina, and Scott, Mike. 2010. Keyness in Texts. Amsterdam: John Benjamins Publishing.CrossRefGoogle Scholar
Brown, Vivienne. 1994. Adam Smith’s Discourse: Canonicity, Commerce and Conscience. London: Routledge.Google Scholar
Cannan, Edwin. 1976. “Editor’s Introduction.” In An Inquiry into the Nature and Causes of the Wealth of Nations by Adam Smith. Edited by Cannan, E.. Chicago: University of Chicago Press, pp. xixliv.Google Scholar
Claveau, François, and Gingras, Yves. 2016. “Macrodynamics of Economics: A Bibliometric History.” History of Political Economy 48 (4): 551592.CrossRefGoogle Scholar
Claveau, François, and Herfeld, Catherine. 2018. “Network Analysis in the History of Economics.” History of Political Economy 50 (3): 597603.CrossRefGoogle Scholar
Coulomb, Fanny. 1998. “Adam Smith: A Defence Economist.” Defence and Peace Economics 9 (3): 299316.CrossRefGoogle Scholar
Covington, Michael A., and McFall, Joe D.. 2010. “Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR).” Journal of Quantitative Linguistics 17 (2): 94100.CrossRefGoogle Scholar
Edwards, José. 2020. “Fifty Years of HOPE.” History of Political Economy 52 (1): 146.CrossRefGoogle Scholar
Evans, James A., and Aceves, Pedro. 2016. “Machine Translation: Mining Text for Social Theory.” Annual Review of Sociology 42 (1): 2150.CrossRefGoogle Scholar
Flesch, Rudolph. 1948. “A New Readability Yardstick.” Journal of Applied Psychology 32 (3): 221233.CrossRefGoogle ScholarPubMed
Geiger, Niels, and Kufenko, Vadim. 2018. Business Cycles and Economic Crises: A Bibliometric and Economic History. London: Routledge.CrossRefGoogle Scholar
Gentzkow, Matthew, Kelly, Bryan, and Taddy, Matt. 2019. “Text as Data.” Journal of Economic Literature 57 (3): 535574.CrossRefGoogle Scholar
Gherity, James. 1993. “An Early Publication by Adam Smith.” History of Political Economy 25 (2): 241282.CrossRefGoogle Scholar
Graafland, Johan, and Wells, Thomas R.. 2021. “In Adam Smith’s Own Words: The Role of Virtues in the Relationship Between Free Market Economies and Societal Flourishing, A Semantic Network Data-Mining Approach.” Journal of Business Ethics 172 (1): 3142.CrossRefGoogle Scholar
Grimmer, Justin, and Stewart, Brandon M.. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267297.CrossRefGoogle Scholar
Gupta, Aaryan, Dengre, Vinya, Kheruwala, Hamza Abubakar, and Shah, Manan. 2020. “Comprehensive Review of Text-Mining Applications in Finance.” Financial Innovation 6 (1): 125.CrossRefGoogle Scholar
Hill, Lisa. 2009. “Adam Smith on War (and Peace).” In Hall, Ian and Hill, Lisa, eds., British International Thinkers from Hobbes to Namier. New York: Palgrave Macmillan, pp. 7189.CrossRefGoogle Scholar
Jockers, Matthew L., and Thalken, Rosamond. 2020. Text Analysis with R: For Students of Literature. Cham: Springer Nature.CrossRefGoogle Scholar
Kincaid, J. Peter, Fishburne, Robert P. Jr., Rogers, Richard L., and Chissom, Brad S.. 1975. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Defense Technical Information Center. https://apps.dtic.mil/sti/pdfs/ADA006655.pdf. Accessed October 26, 2022.Google Scholar
Klein, Daniel B., and Clark, Michael J.. 2011. “The Music of Social Intercourse: Synchrony in Adam Smith.” Independent Review 15 (3): 413420.Google Scholar
Komine, Atsushi. 2019. “Text Mining and the History of Economic Thought: Keynes’s Treatises on Probability, Money, and Employment.” Presented at the Annual Conference of the European Society for the History of Economic Thought (ESHET), Lille, France.Google Scholar
Komine, Atsushi, ed. 2021. テキストマイニングから読み解く経済学史 [The history of economics deciphered from text mining]. Kyoto: ナカニシヤ出版. [In Japanese]Google ScholarPubMed
Komine, Atsushi, and Shimodaira, Hiroyuki. 2018. “Keynesian Elements in Beveridge’s Free Society (1944): A Text Mining Approach to the History of Economic Thought.” Working paper. https://www.econ.ryukoku.ac.jp/about/activity/images/18-01.pdf. Accessed March 9, 2022.Google Scholar
Koppel, Moshe, Schler, Jonathan, and Argamon, Shlomo. 2009. “Computational Methods in Authorship Attribution.” Journal of the American Society for Information Science and Technology 60 (1): 926.CrossRefGoogle Scholar
LaRocque, Paula. 2003. The Book on Writing: The Ultimate Guide to Writing Well. Portland: Marion Street Press.Google Scholar
Liu, Bing. 2020. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Second edition. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Lothian, John M. 1963. Lectures on Rhetoric and Belles Lettres: Delivered in the University of Glasgow by Adam Smith; Reported by a Student in 1762–63. Carbondale: Southern Illinois University Press.Google Scholar
Martindale, Colin, and McKenzie, Dean. 1995. “On the Utility of Content Analysis in Author Attribution: The Federalist.” Computers and the Humanities 29 (4): 259270.CrossRefGoogle Scholar
Matson, Erik W. 2020a. “A Brief History of the Editions of TMS: Part I.” Adam Smith Works. https://www.adamsmithworks.org/documents/a-brief-history-of-the-editions-of-tms. Accessed October 4, 2021.Google Scholar
Matson, Erik W. 2020b. “A Brief History of the Editions of TMS: Part 2.” Adam Smith Works. https://www.adamsmithworks.org/documents/erik-matson-brief-history-of-the-editions-of-tms-part-2. Accessed October 4, 2021.Google Scholar
Matsuyama, Naoki. 2016. “A Study of Text mining for Research into the History of Economic Thought: The Case of Alfred Marshall’s ‘Principles of Economics’ (1890).” https://ips-u-hyogo.jp/publishing/pdf/discussionpaper/DPNo89.pdf. Accessed March 9, 2022.Google Scholar
McKinley, Vicky. 2010. “Age Progression in the Harry Potter Series: Reading Levels, Complexity of Ideas, and Death, or When Should My Child Start Reading Harry Potter?” In Terminus, Collected Papers on Harry Potter, 7–11 August 2008, edited by Goetz, Sharon K.. Sedalia: Narrate Conferences, pp. 377402.Google Scholar
Minowitz, Peter. 1989. “Invisible Hand, Invisible Death: Adam Smith on War and Socio-Economic Development.” Journal of Political & Military Sociology 17 (2): 305315.Google Scholar
Montes, Leonidas. 2003. “Das Adam Smith Problem: Its Origins, the Stages of the Current Debate, and One Implication for Our Understanding of Sympathy.” Journal of the History of Economic Thought 25 (1): 6390.CrossRefGoogle Scholar
Montes, Leonidas. 2004. “Das Adam Smith Problem: Its Origins and the Debate.” In Montes, Leonidas, Adam Smith in Context. Cham: Springer, pp. 1556.CrossRefGoogle Scholar
Paganelli, Maria Pia. 2011. “Theory of Moral Sentiments 1759 vs Theory of Moral Sentiments 1790: A Change of Mind or a Change of Constraints?Studi e Note Di Economia 16 (2): 123132.Google Scholar
Paganelli, Maria Pia, and Schumacher, Reinhard. 2019. “Do Not Take Peace for Granted: Adam Smith’s Warning on the Relation between Commerce and War.” Cambridge Journal of Economics 43 (3): 785797.CrossRefGoogle Scholar
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. https://www.R-project.org/. Accessed September 27, 2022.Google Scholar
Rinker, Tyler W. 2021. sentimentr: Calculate Text Polarity Sentiment. Version 2.9.0. https://github.com/trinker/sentimentr. Accessed March 9, 2022.Google Scholar
Shimodaira, Hiroyuki. 2019. “Application of Text Mining Methods to the History of Economic Thought: An Overview.” History of Economic Thought 61 (1): 104123. [In Japanese]CrossRefGoogle Scholar
Shimodaira, Hiroyuki, and Fukuda, Shinji. 2014. “Popularization of Classical Economics: The Text-Mining Analysis of David Ricardo, James Mill, and Harriet Martineau.” Working paper. https://www-hs.yamagata-u.ac.jp/wp-content/uploads/2017/09/82482bc7496b7b36c0679f8ccad14bff.pdf. Accessed March 9, 2022.Google Scholar
Silge, Julia, and Robinson, David. 2017. Text Mining with R: A Tidy Approach. Sebastopol: O’Reilly Media, Inc.Google Scholar
Smith, Adam. 1759 [2002]. The Theory of Moral Sentiments. Edited by Haakonssen, K.. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Smith, Adam. 1763. Lectures on Jurisprudence. Glasgow Edition of the Works and Correspondence of Adam Smith. Edited by Meek, R. L., Raphael, D. D., and Stein, P. G.. Oxford: Oxford University Press.Google Scholar
Smith, Adam. 1776 [1977]. An Inquiry into the Nature and Causes of the Wealth of Nations. Edited by Cannan, Edwin. Chicago: University of Chicago Press.CrossRefGoogle Scholar
Smith, Joseph A., and Kelly, Coleen. 2002. “Stylistic Constancy and Change across Literary Corpora: Using Measures of Lexical Richness to Date Works.” Computers and the Humanities 36 (4): 411430.CrossRefGoogle Scholar
Stamatatos, Efstathios. 2009. “A Survey of Modern Authorship Attribution Methods.” Journal of the American Society for Information Science and Technology 60 (3): 538556.CrossRefGoogle Scholar
Svorenčík, Andrej. 2018. “The Missing Link: Prosopography in the History of Economics.” History of Political Economy 50 (3): 605613.CrossRefGoogle Scholar
Trincado, Estrella, and Vindel, José María. 2015. “An Application of Econophysics to the History of Economic Thought: The Analysis of Texts from the Frequency of Appearance of Key Words.” Economics Discussion Papers 51. Keil Institute for the World Economy.Google Scholar
Weinstein, Jack Russell. n.d. “Adam Smith (1723–1790).” Internet Encyclopedia of Philosophy. https://iep.utm.edu/smith/. Accessed October 4, 2021.Google Scholar
Welbers, Kasper, Van Atteveldt, Wouter, and Benoit, Kenneth. 2017. “Text Analysis in R.” Communication Methods and Measures 11 (4): 245265.CrossRefGoogle Scholar
Wright, Claire. 2016. “The 1920s Viennese Intellectual Community as a Center for Ideas Exchange: A Network Analysis.” History of Political Economy 48 (4): 593634.CrossRefGoogle Scholar
Yoon, Hye-Joon. 2017. The Rhetoric of Tenses in Adam Smith’s The Wealth of Nations. Boston: Brill.Google Scholar
Figure 0

Table 1. Number of Tokens, Types, Sentences, and Pages for Each Book

Figure 1

Table 2. Most Characteristic Words for Each Book, According to Different Indicators

Figure 2

Figure 1. A Network Graph of the Fifty Strongest Co-occurrences in TMS

Figure 3

Figure 2. A Network Graph of the Fifty Strongest Co-occurrences in WN

Figure 4

Figure 3. MATTR Values throughout TMS and WN

Figure 5

Figure 4. Flesch-Kincaid Grade Level Scores throughout TMS and WN

Figure 6

Figure 5. Boxplots of the Distribution of the Lengths of Sentences and Words in TMS and WN

Figure 7

Figure 6. Relative Frequencies (%) of Personal Pronouns in TMS and WN

Figure 8

Figure 7. Left: Sentiment Scores for All Sentences throughout TMS and WN; Right: Boxplots of the Distribution of Sentiment Values in TMS and WN

Supplementary material: PDF

Ballandonne and Cersosimo supplementary material

Online Appendix

Download Ballandonne and Cersosimo supplementary material(PDF)
PDF 2.2 MB