Hostname: page-component-7bb8b95d7b-pwrkn Total loading time: 0 Render date: 2024-09-26T16:09:51.006Z Has data issue: false hasContentIssue false

Topic Classification for Political Texts with Pretrained Language Models

Published online by Cambridge University Press:  08 March 2023

Yu Wang*
Affiliation:
University of Rochester, Rochester, NY, USA. E-mail: w.y@alum.urmc.rochester.edu
*
Corresponding author Yu Wang
Rights & Permissions [Opens in a new window]

Abstract

Supervised topic classification requires labeled data. This often becomes a bottleneck as high-quality labeled data are expensive to acquire. To overcome the data scarcity problem, scholars have recently proposed to use cross-domain topic classification to take advantage of preexisting labeled datasets. Cross-domain topic classification only requires limited annotation in the target domain to verify its cross-domain accuracy. In this letter, we propose supervised topic classification with pretrained language models as an alternative. We show that language models fine-tuned with 70% of the small annotated dataset in the target corpus could outperform models trained using large cross-domain datasets by 27% and that models fine-tuned with 10% of the annotated dataset could already outperform the cross-domain classifiers. Our models are competitive in terms of training time and inference time. Researchers interested in supervised learning with limited labeled data should find our results useful. Our code and data are publicly available.1

Type
Letter
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of the Society for Political Methodology

1 Introduction

Supervised topic classification requires labeled data for training. This often becomes a bottleneck as high-quality labeled data are expensive to acquire. One way to overcome data scarcity is to use cross-domain topic classification (Osnabrügge, Ash, and Morelli Reference Osnabrügge, Ash and Morelli2021), where researchers train a model from a source domain with large labeled datasets and make inferences on the target domain where labeling is limited. This method takes advantage of two observations: rich labeled data from a source data set and high similarity between the source set and the target set. To evaluate the accuracy of the cross-domain classifier, researchers only need to annotate a small dataset in the target domain.

With the advent of language models (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), however, researchers no longer have to train models from scratch as is done in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021). Rather, researchers could take advantage of existing pretrained language models and fine-tune the already well-trained parameters on specific downstream tasks. Given that language models are known to require a relatively small number of training samples to yield good performance (Longpre, Wang, and DuBois Reference Longpre, Wang and DuBois2020), the small annotated dataset in target domain, which is required for validating cross-domain classifiers, might be sufficient to directly train an accurate in-domain classifier.

In this letter, we present topic classification with pretrained language models as an alternative solution to the data scarcity problem. We show that language models fine-tuned with a portion (70%) of the dataset in the target domain, originally annotated for the cross-domain verification purpose, could substantially outperform cross-domain topic classifiers and that 300 training samples alone would suffice for language models to match or surpass the performance of cross-domain classifiers in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021). We further show that fine-tuning these language models could well fit into researchers’ time budgets.

2 Methodology

Pretrained language models are state-of-the-art models in various natural language processing (NLP) tasks (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019; Lan et al. Reference Lan, Chen, Goodman, Gimpel, Sharma and Soricut2020; Liu et al. Reference Liu2019). The heavy lifting is done during the pretraining stage, where large amounts of unlabeled text, for example, the English Wikipedia, are used to train multilayer transformer models (Vaswani et al. Reference Vaswani2017) for masked language modeling and replaced token detection among other tasks (Clark et al. Reference Clark, Luong, Le and Manning2020).Footnote 2 Fine-tuning these pretrained language models has achieved state-of-the-art results in various NLP tasks, including classification and question answering.

Compared with other NLP models that require training with randomly initialized parameters, one advantage of pretrained language models is that they have large amounts of knowledge packed into their parameters during the pretraining stage and thus they require only a small labeled dataset for fine-tuning these parameters to achieve superb performance (Longpre et al. Reference Longpre, Wang and DuBois2020). This fits well with topic classification for political texts, where labeling is expensive, and offers us an alternative to cross-domain topic classification, which trains parameters from scratch using a labeled dataset from a different but similar domain.

For this letter, we fine-tune a RoBERTa-base model (Liu et al. Reference Liu2019) using the target dataset from Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021) for topic classification.Footnote 3 RoBERTa-base has 12 layers of transformers and 125 million parameters in total. On top of its 12 layers of transformers, we add a classification layer for 44-topic classification and 8-topic classification, respectively.Footnote 4 We use cross-entropy as the loss function. We fine-tune the RoBERTa-base model with a learning rate of 2e-5, a batch size of 16, and an input sequence length of 512 on an A100 GPU. We set the sequence input length to the maximum 512.Footnote 5 We use the validation set’s accuracy to select the best epoch and the optimal checkpoint. We then use the optimal checkpoint to make inferences on the test set with a batch size of 64. For easy comparison, we use the same evaluation metrics as used in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021).

For constructing the train, validation, and test sets, we use the 4,165 New Zealand parliamentary speeches in the target domain in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021).Footnote 6 These 4,165 New Zealand parliamentary speeches were originally labeled to verify the effectiveness of cross-domain topic classification. In this letter, we show that these labeled speeches alone are sufficient to train a competitive topic classifier by fine-tuning a pretrained language model. In our main experiment, we randomly sample 70% from the dataset as the training set, 15% as the validation set, and the remaining 15% as the test set. In total, 2,915 samples are used for training, 625 for validation, and 625 for testing. For reproducibility, we have set a random seed for nondeterministic operations in the experiment (Zhang et al. Reference Zhang, Wu, Katiyar, Weinberger and Artzi2021) and we report the averaged results of five random runs.

3 Results

3.1 Main Experiment

In the main experiment, we use 2,915 (70%) samples for training, 625 (15%) samples for validation, and 625 (15%) samples for testing and run five times with five random seeds. We report the experiment results in Table 1 with mean and standard deviation for each metric. Across all metrics and both 44-topic and 8-topic classification tasks, fine-tuning the RoBERTa model with a subset of the labeled New Zealand parliamentary speeches substantially outperforms the cross-domain topic classifier by Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021), which is trained using 115,420 annotated policy statements.

Table 1 Fine-tuning a RoBERTa-base model with 70% of labeled in-domain data can outperform cross-domain topic classification in both 44-topic and 8-topic tasks by a large margin. Cross-domain classifiers are from Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021). Test set is the same for both models. Mean of five random runs is reported, with standard deviation in the brackets. Better results are in bold.

Specifically, for 44-topic classification, our top-1 accuracy stands at 52.7% and is 27.3% higher than that of the cross-domain classifier, which stands at 41.4%. For 8-topic classification, our top-1 accuracy stands at 63.1% and is 22.5% higher than that of the cross-domain classifier, which stands at 51.5%. We see large gains in other metrics as well: 10%+ gain in top-3 accuracy, 5%+ gain in top-5 accuracy, 16%+ gain in balanced accuracy, and 12%+ gain in F1 macro. The results suggest that it is feasible to train a competitive in-domain classifier with a portion of the target corpus.

3.2 Performance by Topic

In Table 2, we compare the performance of the fine-tuned RoBERTa models with that of the 44-topic cross-domain classifier (top) and the 8-topic cross-domain classifier (bottom) in terms of the accuracy of each topic in the test set using one of the five random runs from the main experiment.Footnote 7 One immediate observation is that for the 44-topic classification, the fine-tuned RoBERTa model performs better for larger topics. For this particular run, for topics with more than 10 samples in the test set, the fine-tuned RoBERTa model does better or equally well for all topics with the exception of “education” and “equality.”

Table 2 Accuracy (recall) comparison by topic. Cross-domain classifiers are from Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021). Test set is the same for both models. N indicates sample size. Random seed is set to 11. Better results are in bold.

Our second observation is that the fine-tuned RoBERTa model’s advantage over the cross-domain classifier disappears on rare topics, such as “nationalization” and “underprivileged minority groups.” Because these topics are rare, the RoBERTa model did not see enough such samples during the training stage.Footnote 8 By contrast, the cross-domain classifier has seen considerably more such samples during its training stage with party manifestos. The cross-domain classifier thus has an advantage in predicting samples on rare topics correctly.Footnote 9

Our third observation is that for 8-topic classification, the fine-tuned RoBERTa model outperforms the cross-domain classifier for seven of the eight topics. This is not surprising given that with fewer topics, the number of samples in each topic will become larger, which in turn ensures that the RoBERTa model sees enough training samples for each topic during the fine-tuning stage. In the next subsection, we explore this question from a slightly different angle: what is the minimum number of samples that we need for the fine-tuned language model to outperform the cross-domain classifiers?

3.3 Number of Training Samples

In this experiment, we study the number of training samples that the fine-tuned language model requires in order to match the performance of that in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021). Our experiment is motivated by the observation that oftentimes researchers may not have access to an annotated target set with as many as 2,915 training samples as we did in the main experiment. Will the fine-tuned language model remain competitive with a much smaller training set? We report our results in Figure 1. We fine-tune the language model for 20 epochs with 200, 300, and 400 training samples, respectively, and split the remaining samples evenly into the validation set and the test set. We run each setting five times and report the mean of top-1 accuracy plus one standard deviation and the mean minus one standard deviation.Footnote 10 For easy comparison, we also include the corresponding performance by the cross-domain classifier as reported in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021).

Figure 1 Model performance increases as the training size increases. With 300 training examples, the fine-tuned RoBERTa model outperforms the cross-domain classifier (the dashed green line) on the 44-topic classification task (left) and on the 8-topic classification task (right).

We observe that with 300 training samples, the fine-tuned language model is able to outperform the cross-domain classifier on the 44-topic classification task (left) and the 8-topic classification task (right). This suggests that depending on task difficulty, researchers with a few hundred training samples may consider a fine-tuned language model as an effective option.

3.4 Training and Inference Time

While language models are known to be slow given their large sizes, we note that their training and inference time could well fit into the time budget of most researchers. In terms of training, on a single A100 GPU with 40 GB memory, it takes 27 minutes to train the model for 20 epochs over 2,915 samples.Footnote 11 Training time will further decrease linearly as we use fewer training samples. To put that into perspective, we note that training the cross-domain classifier with cross-validation in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021) takes 27 minutes on an iMac with 16 CPUs, and generating a single OLS regression table on large datasets could take more than 20 minutes (Stone, Wang, and Yu Reference Stone, Wang and Yu2022). It is certainly not fair to compare GPU time with other models’ CPU time, but we want to note that from the researchers’ point of view, the amount of time used in fine-tuning a language model is mostly comparable to other research methods.

Compared with training, inference is significantly faster. With a batch size of 64, our model makes around 145 inferences per second on a single A100 GPU, which generalizes to 10,000 inferences in a little over 1 minute. With such a quick turnaround in training and inference, our method should fit into the time budget of most researchers.

4 Conclusion

Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021) recently proposed cross-domain supervised training to take advantage of existing labeled data and to reduce data collection costs in classifying political texts. In this letter, we have proposed an alternative that builds on pretrained language models. We have shown that fine-tuning a pretrained language model requires only a small annotated dataset. As a matter of fact, we have shown that with just a small portion (10%) of the annotated dataset that was originally used to evaluate the cross-domain classifier, a fine-tuned RoBERTa-base model can outperform the cross-domain classifier. We have also noted that in topics where there are few to no in-domain training samples, the advantage of the fine-tuned language model over cross-domain classifiers largely disappears. Lastly, we have shown that the fine-tuned models are competitive in terms of training time and inference time. Future research could explore the broader application of pretrained language models, alongside cross-domain classifiers, to other research questions, such as populism prediction (Cocco and Monechi Reference Cocco and Monechi2022), sentiment and stance analysis (Bestvater and Monroe Reference Bestvater and Monroe2022), and party position analysis (Herrmann and Döring Reference Herrmann and Döring2021), as well as the optimization of pretrained language models in training and inference.

Acknowledgments

I thank the reviewers and the editor for their excellent comments and guidance, which substantially improved the paper and inspired new ideas for future work. It is no overstatement to say that they contributed to the paper as co-authors.

Data Availability Statement

The replication materials (Wang Reference Wang2023) are available at https://doi.org/10.7910/DVN/FMT8KR.

Supplementary Material

For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2023.3.

Footnotes

Edited by Jeff Gill

1

The replication materials (Wang 2023) are available at the Political Analysis dataverse site.

2 Note that pretraining language models using generic texts and then fine-tuning them on more specific domains, such as political texts, is itself cross-domain transfer learning.

3 Other popular pretrained language models include BERT-base-uncased, BERT-large-uncased, and RoBERTa-large. We note that BERT-large and RoBERTa-large are substantially larger and thus slower compared with RoBERTa-base. We choose RoBERTa-base because it yields competitive accuracies and is reasonably fast.

4 There are actually 42 topics in the target corpus, so we set the number of labels to 42 for the language models. Performance difference between using 42 labels and 44 labels is minimal. To be consistent with Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021), we use “44-topic classification” throughout.

5 We report statistics on the input sequences and the effects of sequence length on model performance in the Supplementary Material.

6 The source data include all the English-language manifesto statements from English-speaking countries captured by the Manifesto Corpus. The target data include speeches held in the New Zealand Parliament and annotated by the Manifesto coder for New Zealand. For the complete context on the datasets, please refer to Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021).

7 As a robustness check and to show inter-run variations, we report the results from another run in the Supplementary Material.

8 For instance, among the 4,165 annotated parliamentary speeches, there are only 10 samples that fall into the “underprivileged minority groups” topic.

9 Note that given the small sizes of some of the classes, the differences in accuracies for some classes, for example, “underprivileged minority groups,” are not statistically significant in a difference in proportions test (Wang, Li, and Luo Reference Wang, Li and Luo2016).

10 To ensure that the test set is the same in a run across different training sizes, we first sample 400 training samples and then downsample to 200 or 300 when necessary while keeping the dev set and test set the same.

11 There are various ways to further reduce the training time, including freezing a few layers of the 12 transformers, reducing sequence length, using fp16, optimizing the training batch size, and distributing training across multiple GPUs. This is beyond the scope of our letter.

References

Bestvater, S. E., and Monroe, B. L.. 2022. “Sentiment Is Not Stance: Target-Aware Opinion Classification for Political Text Analysis.” Political Analysis.Google Scholar
Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D.. 2020. ELECTRA: Pre-Training Text Encoders as Discriminators Rather than Generators. ICLR.Google Scholar
Cocco, J. D., and Monechi, B.. 2022. “How Populist Are Parties? Measuring Degrees of Populism in Party Manifestos using Supervised Machine Learning.” Political Analysis 30 (3): 311327.10.1017/pan.2021.29CrossRefGoogle Scholar
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of NAACL-HLT, 41714186.Google Scholar
Herrmann, M., and Döring, H.. 2021. “Party Positions from Wikipedia Classifications of Party Ideology.” Political Analysis 31: 2241.10.1017/pan.2021.28CrossRefGoogle Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R.. 2020. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. ICLR.Google Scholar
Liu, Y., et al. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” Preprint, arXiv:1907.11692.Google Scholar
Longpre, S., Wang, Y., and DuBois, C.. 2020. “How Effective Is Task-Agnostic Data Augmentation for Pretrained Transformers?” In Findings of the Association for Computational Linguistics: EMNLP 2020.10.18653/v1/2020.findings-emnlp.394CrossRefGoogle Scholar
Osnabrügge, M., Ash, E., and Morelli, M.. 2021. “Cross-Domain Topic Classification for Political Texts.” Political Analysis 31: 5980.10.1017/pan.2021.37CrossRefGoogle Scholar
Stone, R., Wang, Y., and Yu, S.. 2022. “Chinese Power and the State-Owned Enterprise.” International Organization 76 (1): 229250.10.1017/S0020818321000308CrossRefGoogle Scholar
Vaswani, A., et al. 2017. “Attention Is All You Need.” In 31st Conference on Neural Information Processing Systems.Google Scholar
Wang, Y. 2023. “Replication Data for: Topic Classification for Political Texts with Pretrained Language Models.” Harvard Dataverse, V1. https://doi.org/10.7910/DVN/FMT8KR CrossRefGoogle Scholar
Wang, Y., Li, Y., and Luo, J.. 2016. “Deciphering the 2016 U.S. Presidential Campaign in the Twitter Sphere: A Comparison of the Trumpists and Clintonists.” In Proceedings of the Tenth International AAAI Conference on Web and Social Media.Google Scholar
Zhang, T., Wu, F., Katiyar, A., Weinberger, K. Q., and Artzi, Y.. 2021. Revisiting Few-Sample BERT Fine-Tuning. ICLR.Google Scholar
Figure 0

Table 1 Fine-tuning a RoBERTa-base model with 70% of labeled in-domain data can outperform cross-domain topic classification in both 44-topic and 8-topic tasks by a large margin. Cross-domain classifiers are from Osnabrügge et al. (2021). Test set is the same for both models. Mean of five random runs is reported, with standard deviation in the brackets. Better results are in bold.

Figure 1

Table 2 Accuracy (recall) comparison by topic. Cross-domain classifiers are from Osnabrügge et al. (2021). Test set is the same for both models. N indicates sample size. Random seed is set to 11. Better results are in bold.

Figure 2

Figure 1 Model performance increases as the training size increases. With 300 training examples, the fine-tuned RoBERTa model outperforms the cross-domain classifier (the dashed green line) on the 44-topic classification task (left) and on the 8-topic classification task (right).

Supplementary material: PDF

Wang supplementary material

Online Appendix

Download Wang supplementary material(PDF)
PDF 142.9 KB
Supplementary material: Link
Link