Skip to main content Accessibility help


  • Gee Y Lee (a1), Scott Manski (a2) and Tapabrata Maiti (a3)


In insurance analytics, textual descriptions of claims are often discarded, because traditional empirical analyses require numeric descriptor variables. This paper demonstrates how textual data can be easily used in insurance analytics. Using the concept of word similarities, we illustrate how to extract variables from text and incorporate them into claims analyses using standard generalized linear model or generalized additive regression model. This procedure is applied to the Wisconsin Local Government Property Insurance Fund (LGPIF) data, in order to demonstrate how insurance claims management and risk mitigation procedures can be improved. We illustrate two applications. First, we show how the claims classification problem can be solved using textual information. Second, we analyze the relationship between risk metrics and the probability of large losses. We obtain good results for both applications, where short textual descriptions of insurance claims are used for the extraction of features.


Corresponding author


Hide All
Chollet, F. and Allaire, J. J. (2018) Deep Learning with R. Shelter Island, NY: Manning Publications Co.
Frees, E. W. (2009) Regression Modeling with Actuarial and Financial Applications. Cambridge, UK: Cambridge University Press.
Frees, E. W., Lee, G. Y. and Yang, L. (2016) Multivariate frequency-severity regression models in insurance. Risks, 2016(4): 4.
Goldberg, Y. (2017) Neural Network Methods for Natural Language Processing. San Rafael, CA: Morgan & Claypool Publishers.
Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. Cambridge, MA: MIT Press.
Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Berlin: Springer Science & Business Media.
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Boca Raton, FL: Chapman and Hall.
Kearney, S. (2010). Insurance Operations. Malvern, PA: The Institutes.
Manning, C. D. and Schutze, H. (1999). Foundations of Statistical Natural Language Processing, 1st Edition. Cambridge, MA: The MIT Press.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013) Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26: 31113119.
Pennington, J., Socher, R. and Manning, C. D. (2014). Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), vol. 2014, pp. 15321543.
Sokolova, M. and Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45:427437.
Wood, S. (2013). On p values for smooth components of an extended generalized additive model. Biometrika 100, 221228.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition. Boca Raton, FL: CRC Press.



  • Gee Y Lee (a1), Scott Manski (a2) and Tapabrata Maiti (a3)


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed