Poisson mixtures

Kenneth W. Church; William A. Gale

doi:10.1017/S1351324900000139

Poisson mixtures

Published online by Cambridge University Press: 12 September 2008

Kenneth W. Church and

William A. Gale

Show author details

Kenneth W. Church: Affiliation:
AT&T Bell Laboratories, Murray Hill, NJ 07974, USA. e-mail: kwc@research.att.com
William A. Gale: Affiliation:
AT&T Bell Laboratories, Murray Hill, NJ 07974, USA. e-mail: kwc@research.att.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Shannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a “bag-of-words” assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Г distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2/x ≥ 1)).

Type: Articles
Information: Natural Language Engineering , Volume 1 , Issue 2 , June 1995 , pp. 163 - 190

DOI: https://doi.org/10.1017/S1351324900000139 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 1995

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bell, T., Cleary, J., and Witten, I., (1990) Text Compression. Prentice Hall.Google Scholar

Bookstein, A., (1982) Explanation and generalization of vector models in information. In Conference on Research and Development in Information Retrieval (SIGIR). Pp. 118–132.Google Scholar

Bookstein, A., and Swanson, D., (1974). Probabilistic models for automatic indexing. Journal of the American Society for Information Science. 25(5): 312–318.CrossRef Google Scholar

Clover, T., and Thomas, J., (1991) Elements of Information Theory. John Wiley.Google Scholar

Francis, W., and Kucera, H., (1982) Frequency Analysis of English Usage. Houghton Mifflin Co.Google Scholar

Gale, W., Church, K., and Yarowsky, D., (1993) A method for disambiguating word senses in a large corpus. Computers and Humanities. 415–439.Google Scholar

Harter, S., (1975) A probabilistic approach to automatic keyword indexing: Part I. On the distribution of speciality words in a technical literature. Journal of the American Society for Information Science. 26(4): 197–206.CrossRef Google Scholar

Johnson, N., and Kotz, S., (1969) Discrete Distributions. Houghton Mifflin Co.Google Scholar

Lau, R., Rosenfeld, R., and Roukos, S., (1993) Adaptive Language Modeling using the Maximum Entropy Principle. ARPA sponsored workshop on Human Language Technology. Morgan Kaufmann. Pp. 108–113.Google Scholar

Mosteller, F., and Wallace, D., (1964) Inference and Disputed Authorship: The Federalist. Addison-Wesley.Google Scholar

Robertson, S., and Walker, S., (1994) Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In SIGIR. Pp. 232–241.CrossRef Google Scholar

Salton, G., (1989) Automatic Text Processing. Addison-Wesley.Google Scholar

Shannon, C., (1948) The mathematical theory of communication. Bell System Technical Journal.CrossRef Google Scholar

Sparck, Jones K., (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation. 28(1): 11–21.CrossRef Google Scholar

van Rijsbergen, C., (1979) Information Retrieval. Second Edition. Butterworth.Google Scholar

Yarowsky, D., (1992) Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora. Coling.Google Scholar

Article contents

Poisson mixtures

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests