Viterbi training in PRISM

TAISUKE SATO; KEIICHI KUBOTA

doi:10.1017/S1471068413000677

Viterbi training in PRISM

Published online by Cambridge University Press: 28 January 2014

TAISUKE SATO and

KEIICHI KUBOTA

Show author details

TAISUKE SATO: Affiliation:
Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro, Tokyo, Japan (e-mail: sato@mi.cs.titech.ac.jp)
KEIICHI KUBOTA: Affiliation:
Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro, Tokyo, Japan (e-mail: kubota@mi.cs.titech.ac.jp)

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

VT (Viterbi training), or hard expectation maximization (EM), is an efficient way of parameter learning for probabilistic models with hidden variables. Given an observation y, it searches for a state of hidden variables x that maximizes p(x,y | θ) by coordinate ascent on parameters θ and x. In this paper we introduce VT to PRogramming In Statistical Modeling (PRISM), a logic-based probabilistic modeling system for generative models. VT improves PRISM in three ways. First, VT in PRISM converges faster than EM in PRISM due to VT's termination condition. Second, parameters learned by VT often show good prediction performance compared with those learned by EM. We conducted two parsing experiments with probabilistic grammars while learning parameters by a variety of inference methods, i.e. VT, EM, MAP and VB. The result is that VT achieved the best parsing accuracy among them in both experiments. Also, we conducted a similar experiment for classification tasks where a hidden variable is not a prediction target unlike probabilistic grammars. We found that in such a case VT does not necessarily yield superior performance. Third, since VT always deals with a single probability of a single explanation, Viterbi explanation, the exclusiveness condition imposed on PRISM programs is no more required if we learn parameters by VT. Last but not least, we can say that as VT in PRISM is general and applicable to any PRISM program, it largely reduces the need for the user to develop a specific VT algorithm for a specific model. Furthermore, since VT in PRISM can be used just by setting a PRISM flag appropriately, it makes VT easily accessible to (probabilistic) logic programmers.

Keywords

Viterbi training PRISM exclusiveness condition

Type: Regular Papers
Information: Theory and Practice of Logic Programming , Volume 15 , Issue 2: Probability, Logic and Learning , March 2015 , pp. 147 - 168

DOI: https://doi.org/10.1017/S1471068413000677 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bache, K. and Lichman, M. 2013. UCI Machine Learning Repository [http://archiveics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.Google Scholar

Bellodi, E. and Riguzzi, F. 2012. Expectation maximization over binary decision diagrams for probabilistic logic programs. Intelligent Data Analysis 16, 6.Google Scholar

Brown, P., Pietra, V., Pietra, S. and Mercer, R. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 263–311.Google Scholar

Castillo, G. and Gama, J. 2005. Bias management of Bayesian network classifiers. In Discovery Science – DS 2005, 8th International Conference, Singapore, Lecture Notes in Artificial Intelligence, Vol. 3735. Springer-Verlag, New York, NY, 70–83.Google Scholar

Cohen, S. and Smith, N. 2010. Viterbi training for PCFGs: Hardness results and competitiveness of uniform initialization. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL'10). 1502–1511.Google Scholar

De Raedt, L. and Kersting, K. 2008. Probabilistic inductive logic programming. In Probabilistic Inductive Logic Programming – Theory and Applications, Raedt, L. De, Frasconi, P., Kersting, K., and Muggleton, S., Eds. Lecture Notes in Computer Science, Vol. 4911. Springer, New York, NY, 1–27.Google Scholar

De Raedt, L., Kimmig, A. and Toivonen, H. 2007. ProbLog: A probabilistic Prolog and its application in link discovery. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI'07). MIT Press, Cambridge, MA, 2468–2473.Google Scholar

Friedman, N., Geiger, D. and Goldszmidt, M. 1997. Bayesian network classifiers. Machine Learning 29, 2, 131–163.Google Scholar

Getoor, L. and Taskar, B., Eds. 2007. Introduction to Statistical Relational Learning. MIT Press, Cambridge, MA.Google Scholar

Goodman, J. 1996. Parsing algorithms and metrics. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL'96). ACL, New York, NY, 177–183.Google Scholar

Gutmann, B., Kimmig, A., Kersting, K. and De Raedt, L. 2008. Parameter learning in probabilistic databases: A least squares approach. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD 2008), Part I. Springer, New York, NY, 473–488.Google Scholar

Gutmann, B., Thon, I. and De Raedt, L. 2011. Learning the parameters of probabilistic logic programs from interpretations. In Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD 2011), Part I, LNCS, Vol. 6911. Springer, New York, NY, 581–596.Google Scholar

Huynh, T. and Mooney, R. 2010. Online max-margin weight learning with Markov logic networks. In Proceedings of the AAAI-10 Workshop on Statistical Relational AI (Star-AI 10). 32–37.Google Scholar

Japkowicz, N. and Shah, M., Eds. 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, Cambridge, UK.Google Scholar

Jiang, L., Zhang, H. and Cai, Z. 2009. A novel Bayes model: Hidden naive Bayes. IEEE Transactions on Knowledge and Data Engineering 21, 10, 1361–1371.Google Scholar

Joshi, D., Li, J. and Wang, J. 2006. A computationally efficient approach to the estimation of two- and three-dimensional hidden Markov models. IEEE Transactions on Image Processing 15, 7, 1871–1886.Google Scholar

Juang, B. and Rabiner, L. 1990. The segmental K-means algorithm for estimating parameters of hidden Markov models. IEEE Transactions on Signal Processing 38, 1639–1641.Google Scholar

Kimmig, A., Costa, V., Rocha, R., Demoen, B. and De Raedt, L. 2008. On the efficient execution of ProbLog programs. In Proceedings of the 24th International Conference on Logic Programming (ICLP'08). 175–189.Google Scholar

Lember, J. and Koloydenko, A. 2007. Adjusted viterbi training. Probability in the Engineering and Informational Sciences 21, 3, 451–475.Google Scholar

Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. and Borodovsky, M. 2005. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research 33, 6494–6506.CrossRef Google Scholar PubMed

MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. 281–297.Google Scholar

Manning, C. 1997. Probabilistic parsing using left corner language models. In Proceedings of the 5th International Conference on Parsing Technologies (IWPT-97). MIT Press, Cambridge, MA, 147–158.Google Scholar

Riguzzi, F. and Swift, T. 2011. The PITA system: Tabling and answer subsumption for reasoning under uncertainty. Theory and Practice of Logic Programming (TPLP) 11, 4–5, 433–449.CrossRef Google Scholar

Roark, B. and Johnson, M. 1999. Efficient probabilistic top-down and left-corner parsing. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 421–428.Google Scholar

Sato, T. 1995. A statistical learning method for logic programs with distribution semantics. In Proceedings of the 12th International Conference on Logic Programming (ICLP'95). Cambridge University Press, Cambridge, UK, 715–729.Google Scholar

Sato, T. 2007. Inside-outside probability computation for belief propagation. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI '07). 2605–2610.Google Scholar

Sato, T. 2011. A general MCMC method for Bayesian inference in logic-based probabilistic modeling. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI '11). 1472–1477.Google Scholar

Sato, T. and Kameya, Y. 2001. Parameter learning of logic programs for symbolic-statistical modeling. Journal of Artificial Intelligence Research 15, 391–454.Google Scholar

Sato, T. and Kameya, Y. 2008. New advances in logic-based probabilistic modeling by PRISM. In Probabilistic Inductive Logic Programming, De Raedt, L., Frasconi, P., Kersting, K. and Muggleton, S., Eds. LNAI, Vol. 4911. Springer, New York, NY, 118–155.Google Scholar

Sato, T., Kameya, Y. and Kurihara, K. 2009. Variational Bayes via propositionalized probability computation in PRISM. Annals of Mathematics and Artificial Intelligence 54, 135–158.CrossRef Google Scholar

Singla, P. and Domingos, P. 2005. Discriminative training of Markov logic networks. In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05), Veloso, M. M. and Kambhampati, S., Eds. Kluwer, the Netherlands, 868–873.Google Scholar

Spitkovsky, V., Alshawi, H., Jurafsky, D. and Manning, C. 2010. Viterbi training improves unsupervised dependency parsing. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning. 9–17.Google Scholar

Strom, N., Hetherington, L., Hazen, T., Sandness, E. and Glass, J. 1999. Acoustic modeling improvements in a segment-based speech recognizer. In Proceedings of IEEE ASRU Workshop (ASRU'99). IEEE Signal Processing Society, 139–142.Google Scholar

Su, J. and Zhang, H. 2006. Full Bayesian network classifiers. In Proceedings of the 23rd International Conference on Machine Learning (ICML'06). 897–904.CrossRef Google Scholar

Uratani, N., Takezawa, T., Matsuo, H. and Morita, C. 1994. ATR Integrated Speech and Language Database. Technical Report TR-IT-0056, ATR Interpreting Telecommunications Research Laboratories, Kyoto, Japan. (in Japanese).Google Scholar

Van Uytsel, D., Van Compernolle, D. and Wambacq, P. 2001. Maximum-likelihood training of the PLCG-based language model. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU'01). IEEE Signal Processing Society, 210–213.Google Scholar

Webb, G., Boughton, J. and Wang, Z. 2005. Not so naive Bayes: Aggregating one-dependence estimators. Machine Learning 58, 1, 5–24.Google Scholar

Zhou, N.-F., Kameya, Y. and Sato, T. 2010. Mode-directed tabling for dynamic programming, machine learning, and constraint solving. In Proceedings of the 22th International Conference on Tools with Artificial Intelligence (ICTAI-2010). IEEE Computer Society, 213–218.Google Scholar

Zhou, N.-F., Sato, T. and Shen, Y.-D. 2008. Linear tabling strategies and optimization. Theory and Practice of Logic Programming (TPLP) 8, 1, 81–109.Google Scholar

Article contents

Viterbi training in PRISM

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests