MUSED: A multimedia multi-document dataset for topic segmentation

PEDRO MOTA; MAXINE ESKENAZI; LUÍSA COHEUR

doi:10.1017/S1351324918000359

MUSED: A multimedia multi-document dataset for topic segmentation

Published online by Cambridge University Press: 22 October 2018

PEDRO MOTA ,

MAXINE ESKENAZI and

LUÍSA COHEUR

Show author details

PEDRO MOTA: Affiliation:
Instituto Superior Técnico, Carnegie Mellon University, Rua Alves Redol 9, Lisbon 1000-029, Portugal e-mail: pedrom@andrew.cmu.edu
MAXINE ESKENAZI: Affiliation:
Carnegie Mellon University, 6413 Gates Hillman Complex, 5000 Forbes Ave, Pittsburgh, PA 15213, USA e-mail: max@cs.cmu.edu
LUÍSA COHEUR: Affiliation:
Instituto Superior Técnico, Rua Alves Redol 9, Lisbon 1000-029, Portugal e-mail: luisa.coheur@inesc-id.pt

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Research on topic segmentation has recently focused on segmenting documents by taking advantage of documents covering the same topics. In order to properly evaluate such approaches, a dataset of related documents is needed. However, existing datasets are limited in the number of related documents per domain. In addition, most of the available datasets do not consider documents from different media sources (PowerPoints, videos, etc.), which pose specific challenges to segmentation. We fill this gap with the MUltimedia SEgmentation Dataset (MUSED), a collection of documents manually segmented, from different media sources, in seven different domains, with an average of twenty related documents per domain. In this paper, we describe the process of building MUSED. A multi-annotator study is carried out to determine if it is possible to observe agreement among human judges and characterize their disagreement patterns. In addition, we use MUSED to compare the state-of-the-art topic segmentation techniques, including the ones that take advantage of related documents. Moreover, we study the impact of having documents from different media sources in the dataset. To the best of our knowledge, MUSED is the first dataset that allows a straightforward evaluation of both single- and multiple-documents topic segmentation techniques, as well as to study how these behave in the presence of documents from different media sources. Results show that some techniques are, indeed, sensitive to different media sources, and also that current multi-document segmentation models do not outperform previous models, pointing to a research line that needs to be boosted.

Type: Article
Information: Natural Language Engineering , Volume 24 , Issue 6 , November 2018 , pp. 921 - 946

DOI: https://doi.org/10.1017/S1351324918000359 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013, also under projects LAW-TRAIN (H2020-EU.3.7, contract 653587), and through the Carnegie Mellon Portugal Program under Grant SFRH/BD/51917/2012.

References

Alemi, A., and Ginsparg, P. 2015. Text segmentation based on semantic word embeddings. ArXiv e-prints, 1503.05543.Google Scholar

Balagopalan, A., and Damodar, A., 2012. Automatic keyphrase extraction and segmentation of video lectures. In Proceedings of the International Conference on Technology Enhanced Education, Amritapuri, India: ICTEE 2012, pp. 1–10.Google Scholar

Bougouin, A., Boudin, F., and Daille, B., 2013. Topicrank: graph-based topic ranking for keyphrase extraction. In Proceedings of the International Joint Conference on Natural Language Processing, Nagoya, Japan: Asian Federation of Natural Language Processing, pp. 543–551.Google Scholar

Choi, F. Y., 2000. Advances in domain independent linear text segmentation. In Proceedings of the North American Chapter of the Association for Computational Lingustics, Seattle, Washington, USA: Association for Computational Linguistics, pp. 26–33.Google Scholar

Cohen, J., 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1): 37.Google Scholar

Du, L., Buntine, W. L., and Johnson, M., 2013. Topic segmentation with a structured topic model. In Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics, Atlanta, Georgia, USA: Association for Computational Lingustics, pp. 190–200.Google Scholar

Du, L., Pate, J., and Johnson, M., 2015. Topic segmentation with an ordering-based topic model. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference, Austin, Texas, USA: AAAI Press, pp. 2232–2238.Google Scholar

Eisenstein, J., and Barzilay, R., 2008. Bayesian unsupervised topic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii: Association for Computational Linguistics, pp. 334–343.Google Scholar

Eisenstein, J., 2009. Hierarchical text segmentation from multi-scale lexical cohesion. In Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics, Boulder, Colorado, USA: Association for Computational Lingustics, pp. 353–361.Google Scholar

Fournier, C., 2013. Evaluating text segmentation using boundary edit distance. In Proceedings of the Annual Meeting of the Association for Computational Lingustics, Sofia, Bulgaria: Association for Computational Lingustics, pp. 1702–1712.Google Scholar

Francis, W. N., and Kucera, H., 1979. The Brown Corpus: A Standard Corpus of Present-Day Edited American English. Brown University: Lingustics Department.Google Scholar

Frey, B. J., and Dueck, D., 2007. Clustering by passing messages between data points. Science 315 (5814): 972–977.Google Scholar

Galley, M., McKeown, K., Fosler, E., and Jing, H., 2003. Discourse segmentation of multi-party conversation. In Proceedings of the Annual Meeting on Association for Computational Lingustics, Sapporo, Japan: Association for Computational Lingustics, pp. 562–569.Google Scholar

Haghighi, A., and Vanderwende, L., 2009. Annotating semantic relations combining facts and opinions. In Proceedings of the 3rd Linguistic Annotation Workshop, Suntec, Singapore: Association for Computational Lingustics, pp. 362–370.Google Scholar

Halliday, M. A., and Hasan, R., 1976. Cohesion in English. London: Longman.Google Scholar

Hearst, M. A., 1997. Texttiling: segmenting text into multi-paragraph subtopic passages. Computational Lingustics 23 (1): 33–64.Google Scholar

Hsueh, P., Moore, J., and Renals, S., 2006. Automatic segmentation of multiparty dialogue. In Proceedings of the European Chapter of the Association for Computational Linguistics, Trento, Italy: Association for Computational Lingustics, pp. 273–280.Google Scholar

Jain, S., and Neal, R., 2004. A split-merge Markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics 13 (1): 158–182.Google Scholar

Jameel, S., and Lam, W., 2013. An unsupervised topic segmentation model incorporating word order. In Proceedings of the International Conference on Research and Development in Information Retrieval, Dublin, Ireland: ACM, pp. 203–212.Google Scholar

Janin, A., Ang, J., Bhagat, S., and Wrede, B., 2004. The ICSI meeting project: resources and research. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing Workshop, Montreal, Canada: Prentice Hall, pp. 364–367.Google Scholar

Johnson, N., Kotz, S., and Balakrishnan, N., 1997. Discrete Multivariate Distributions. New York: Wiley-Interscience.Google Scholar

Joty, S., Carenini, G., and Ng, R., 2013. Topic segmentation and labeling in asynchronous conversations. Journal of Artificial Intelligence Research 47 (1): 521–573.Google Scholar

Kazantseva, A., and Szpakowicz, S., 2011. Linear text segmentation using affinity propagation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, U.K.: Association for Computational Lingustics, pp. 284–293.Google Scholar

Kazantseva, A., and Szpakowicz, S., 2012. Topical segmentation: a study of human performance and a new measure of quality. In Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics, Montreal, Canada: Association for Computational Lingustics, pp. 211–220.Google Scholar

Krippendorff, K., 2004. Content Analysis: An Introduction to its Methodology. London: SAGE Publications.Google Scholar

Malioutov, I., and Barzilay, R., 2006. Minimum cut model for spoken lecture segmentation. In Proceedings of the International Conference on Computational Lingustics, Sydney, Australia: Association for Computational Lingustics, pp. 25–32.Google Scholar

Minwoo, J., and Ivan, T., 2010. Multi-document topic segmentation. In Proceedings of the Association for Computational Lingustics International Conference on Information and Knowledge Management, Toronto, Canada: ACM, pp. 1119–1128.Google Scholar

Mota, P., Eskenazi, M., and Coheur, L., 2016. Multi-document topic segmentation using Bayesian estimation. In Proceedings of the International Workshop on Semantic Multimedia, Laguna Hills, CA, USA: IEEE, pp. 443–447.Google Scholar

Nguyen, V. A., Boyd-Graber, J., Resnik, P., Cai, D. A., Midberry, J. E., and Wang, Y., 2014. Modeling topic control to detect influence in conversations using nonparametric topic models. Machine Learning 95 (3): 381–421.Google Scholar

Noh, H., Jeong, M., Lee, S., Lee, J., and Lee, G., 2010. Script-description pair extraction from text documents of english as second language podcast. In Proceedings of the International Conference on Computer Supported Education, Valencia, Spain: SciTePress, pp. 5–10.Google Scholar

Passonneau, R. J., and Litman, D. J., 1997. Discourse segmentation by human and automated means. Computational Lingustics 23 (1): 103–139.Google Scholar

Pennington, J., Socher, R., and Manning, C. D., 2014. Glove: global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar: Association for Computational Lingustics, pp. 1532–1543.Google Scholar

Pevzner, L., and Hearst, M. A., 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Lingustics 28 (1): 19–36.Google Scholar

Prince, V., and Labadie, A., 2007. Segmentation based on document understanding for information retrieval. In Proceedings of the International Conference on Application of Natural Language to Information Systems, Paris, France: Berlin: Springer, pp. 295–304.Google Scholar

Purver, M., Griffiths, T. L., Körding, K. P., and Tenenbaum, J. B., 2006. Unsupervised topic modelling for multi-party spoken discourse. In Proceedings of the International Conference on Computational Lingustics, Sydney, Australia: Association for Computational Linguistics, pp. 17–24.Google Scholar

Riedl, M., and Biemann, C., 2012. Topictiling: a text segmentation algorithm based on LDA. In Proceedings of the Association for Computational Lingustics Student Research Workshop, Jeju Island, Korea: Association for Computational Linguistics, pp. 37–42.Google Scholar

Scott, W. A., 1955. Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly 19 (3): 321–325.Google Scholar

Shah, R., Yu, Y., Shaikh, A., Tang, S., and Zimmermann, R., 2014. ATLAS: automatic temporal segmentation and annotation of lecture videos based on modelling transition time. In Proceedings of the Association for Computational Lingustics International Conference on Multimedia, Orlando, Florida, USA: ACM, pp. 209–212.Google Scholar

Shah, R., Yu, Y., Shaikh, A., Tang, S., and Zimmermann, R., 2015. TRACE: linguistic-based approach for automatic lecture video segmentation leveraging Wikipedia texts. In Proceedings of the International Symposium on Multimedia, Miami, Florida, USA: IEEE, pp. 217–220.Google Scholar

Shah, R., and Zimmermann, R., 2017. Multimodal Analysis of User-Generated Multimedia Content. Cham, Switzerland: Springer International Publishing.Google Scholar

Shahaf, D., Guestrin, C., and Horvitz, E., 2012. Trains of thought: generating information maps. In Proceedings of the International Conference on World Wide Web, Lyon, France: ACM, pp. 899–908.Google Scholar

Shrout, P. E., and Fleiss, J. L., 1979. Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin 86 (2): 420–428.Google Scholar

Sun, B., Mitra, P., Giles, L., Yen, J., and Zha, H., 2007. Topic segmentation with shared topic detection and alignment of multiple documents. In Proceedings of Association for Computational Lingustics Special Interest Group on Information Retrieval, Amsterdam, The Netherlands: ACM, pp. 199–206.Google Scholar

Utiyama, M., and Isahara, H., 2001. A statistical model for domain-independent text segmentation. In Proceedings of the Annual Meeting on Association for Computational Lingustics, Toulouse, France: Association for Computational Linguistics, pp. 499–506.Google Scholar

Walker, H., Dallas, W., and Willis, J., 1990. Clinical Methods: The History, Physical, and Laboratory Examinations. Boston: Butterworths.Google Scholar

Ward, N. G., Werner, S. D., Novick, D. G., Shriberg, E. E., Oertel, C., and Kawahara, T. 2013. The similar segments in social speech task. In Working Notes Proceedings of the MediaEval Workshop, Barcelona, Spain.Google Scholar

Watanabe, S., Iwata, T., Hori, T., Sako, A., and Ariki, Y., 2011. Topic tracking language model for speech recognition. Computer Speech and Language 25 (2): 440–461.Google Scholar

Article contents

MUSED: A multimedia multi-document dataset for topic segmentation

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests