Syntactic methods for topic-independent authorship attribution

JOHANNA BJÖRKLUND; NIKLAS ZECHNER

doi:10.1017/S1351324917000249

Syntactic methods for topic-independent authorship attribution

Published online by Cambridge University Press: 09 August 2017

JOHANNA BJÖRKLUND and

NIKLAS ZECHNER

Show author details

JOHANNA BJÖRKLUND: Affiliation:
Deptartment of Computer Science, Umeå University, 901 87 Umeå, Sweden e-mails: johanna@cs.umu.se, zechner@cs.umu.se
NIKLAS ZECHNER: Affiliation:
Deptartment of Computer Science, Umeå University, 901 87 Umeå, Sweden e-mails: johanna@cs.umu.se, zechner@cs.umu.se

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The efficacy of syntactic features for topic-independent authorship attribution is evaluated, taking a feature set of frequencies of words and punctuation marks as baseline. The features are ‘deep’ in the sense that they are derived by parsing the subject texts, in contrast to ‘shallow’ syntactic features for which a part-of-speech analysis is enough. The experiments are made on two corpora of online texts and one corpus of novels written around the year 1900. The classification tasks include classical closed-world authorship attribution, identification of separate texts among the works of one author, and cross-topic authorship attribution. In the first tasks, the feature sets were fairly evenly matched, but for the last task, the syntax-based feature set outperformed the baseline feature set. These results suggest that, compared to lexical features, syntactic features are more robust to changes in topic.

Type: Articles
Information: Natural Language Engineering , Volume 23 , Issue 5 , September 2017 , pp. 789 - 806

DOI: https://doi.org/10.1017/S1351324917000249 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Argamon, S., and Shimoni, A. R. 2003. Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17 (4): 401–12.Google Scholar

Augsten, N., Böhlen, M. H., and Gamper, J., 2005. Approximate matching of hierarchical data using pq-grams. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway: Norwegian University of Science & Technology, pp. 301–12.Google Scholar

Ayala, D. V., Pinto, D., Gómez-Adorno, H., León, S., and Castillo, E. 2013. Lexical-syntactic and graph-based features for authorship verification notebook for PAN at CLEF 2013. In Forner, P., Navigli, R., Tufis, D., and Ferro, N. (eds.), Working Notes for CLEF 2013 Conference, Valencia, Spain, 2013, vol. 1179. Aachen, Germany: Sun SITE Central Europe, 1–6.Google Scholar

Baayen, H., van Halteren, H., and Tweedie, F. 1996. Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11 (3): 121–32.Google Scholar

Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3: 993–1022 (see http://www.jmlr.org/papers/v3/).Google Scholar

Collins, J., Kaufer, D., Vlachos, P., Butler, B., and Ishizaki, S. 2004. Detecting collaborations in text: comparing the authors’ rhetorical language choices in the federalist papers. Computers and the Humanities 38 (1): 15–36.Google Scholar

Feng, S., Banerjee, R., and Choi, Y., 2012. Characterizing stylistic Elements in syntactic structure. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, Jeju Island, Korea, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 1522–33.Google Scholar

Frantzeskou, G., Stamatatos, E., Gritzalis, S., and Katsikas, S., 2006. Effective identification of source code authors using byte-level information. In Proceedings of the 28th International Conference on Software Engineering, New York, NY, USA: ACM Press, pp. 893–6.Google Scholar

Fuller, S., Maguire, P., and Moser, P. 2014. A deep context grammatical model For authorship attribution. In Calzolari, N. et al. (eds.), Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland: European Language Resources Association, 4488–4492.Google Scholar

Gamon, M. 2004. Linguistic correlates of style: authorship classification with deep linguistic analysis features. In Computational Linguistics. Morristown, NJ, USA: Association for Computational Linguistics.Google Scholar

Graham, N., Hirst, G., and Marthi, B. 2005. Segmenting documents by stylistic character. Natural Language Engineering 3 (11): 397–415.Google Scholar

Hollingsworth, C. 2012. Using dependency-based annotations for authorship identification. In Proceedings of the 15th International Conference, Brno, Czech Republic, 2012 Sojka, P., Horák, A., Kopeček, I., and Pala, K. (eds.), vol. 7499, pp. 314–19. Berlin, Heidelberg: Springer.Google Scholar

Klein, D., and Manning, C. D., 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 423–30.Google Scholar

Koppel, M., Argamon, S., and Shimoni, A. R. 2002. Automatically categorizing written texts by author gender. Literary and Linguistic Computing 4 (17): 401–12.Google Scholar

Koppel, M., and Schler, J., 2004. Authorship verification as a one-class classification problem. In Proceedings of the 21st International Conference on Machine Learning, New York, NY, USA: ACM Press, p. 62.Google Scholar

Lučić, A., and Blake, C. L. 2015. A syntactic characterization of authorship style surrounding proper names. Digital Scholarship in the Humanities 30 (1): 53–70.CrossRef Google Scholar

Luyckx, K., and Daelemans, W. 2005. Shallow text analysis and machine learning for authorship attribution. In Computational Linguistics in the Netherlands 2004: Selected Papers from the 15th CLIN Meeting, Scott, D. and Uszkoreit, H., pp. 149–60. Utrecht, the Netherlands: Netherlands Graduate School of Linguistics.Google Scholar

Luyckx, K., and Daelemans, W., 2008. Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 513–20.Google Scholar

Menon, R., and Choi, Y. 2011. Domain independent authorship attribution without domain adaptation. In Angelova, G., Bontcheva, K., Mitkov, R., and Nicolov, N. (eds.), Recent Advances in NLP, pp. 309–15. Sofia, Bulgaria: Bulgarian Academy of Sciences.Google Scholar

Mikros, G. K., and Argiri, E. K. 2007. Investigating topic influence in authorship attribution. In Stein, B., Koppel, M., and Stamatatos, E. (eds.), Proceedings of the International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, vol. 276. Aachen, Germany: Sun SITE Central Europe, 29–35.Google Scholar

Olmos, I., Gonzalez, J. A., and Osorio, M., 2005. Subgraph isomorphism detection using a code based representation. In FLAIRS Conference, Miami, FL, USA: The Florida Artificial Intelligence Research Society, pp. 474–9.Google Scholar

Raghavan, S., Kovashka, A., and Mooney, R., 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference (Short Papers), Uppsala, Sweden, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 38–42.Google Scholar

Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P., 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Arlington, Virginia, United States: AUAI Press, pp. 487–94.Google Scholar

Seroussi, Y., Bohnert, F., and Zukerman, I., 2012. Authorship attribution with author-aware topic models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 264–9.Google Scholar

Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60 (3): 538–56.Google Scholar

Stamatatos, E., Kokkinakis, G., and Fakotakis, N. 2000. Automatic text categorization in terms of genre and author. Computional Linguistics 26 (4): 471–95.Google Scholar

Stein, B., and zu Eissen, S. M., 2007. Intrinsic plagiarism analysis with meta learning. In Proceedings of the SIGIR Workshop on Plagiarism Analysis, Authorship Attribution, and Near-Duplicate Detection, Aachen, Germany: Sun SITE Central Europe, pp. 45–50.Google Scholar

Tschuggnall, M., and Specht, G., 2014. Enhancing authorship attribution by utilizing syntax tree profiles. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 195–9.Google Scholar

Wiersma, W., Nerbonne, J., and Lauttamus, T. 2011. Automatically extracting typical syntactic differences from corpora. Literary and Linguistic Computing 26 (1): 107–24.Google Scholar

Wilcoxon, F. 1945. Individual comparisons by ranking methods. Biometrics Bulletin 1 (6): 80–3.Google Scholar

Zechner, N. 2015. Formal Foundations of Authorship Attribution. Licentiate Thesis, Umeå, Sweden: Umeå University Google Scholar

zu Eissen, S. M., Stein, B., and Kulig, M. 2007. Plagiarism detection without reference collections. In Decker, R. and Lenz, H.-J. (eds.), Advances in Data Analysis, Proceedings of the 30th Annual Conference of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin, 2006 Springer, Berlin, Germany, 359–366.Google Scholar

Article contents

Syntactic methods for topic-independent authorship attribution

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests