Hostname: page-component-848d4c4894-xfwgj Total loading time: 0 Render date: 2024-06-22T17:28:21.288Z Has data issue: false hasContentIssue false

Ranking facts for explaining answers to elementary science questions

Published online by Cambridge University Press:  24 January 2022

Jennifer D’Souza*
Affiliation:
TIB Leibniz Information Centre for Science and Technology, Hannover, Germany
Isaiah Onando Mulang
Affiliation:
University of Bonn, Bonn, Germany
Sören Auer
Affiliation:
TIB Leibniz Information Centre for Science and Technology, Hannover, Germany
*
*Corresponding author. E-mail: jennifer.dsouza@tib.eu

Abstract

In multiple-choice exams, students select one answer from among typically four choices and can explain why they made that particular choice. Students are good at understanding natural language questions and based on their domain knowledge can easily infer the question’s answer by “connecting the dots” across various pertinent facts. Considering automated reasoning for elementary science question answering, we address the novel task of generating explanations for answers from human-authored facts. For this, we examine the practically scalable framework of feature-rich support vector machines leveraging domain-targeted, hand-crafted features. Explanations are created from a human-annotated set of nearly 5000 candidate facts in the WorldTree corpus. Our aim is to obtain better matches for valid facts of an explanation for the correct answer of a question over the available fact candidates. To this end, our features offer a comprehensive linguistic and semantic unification paradigm. The machine learning problem is the preference ordering of facts, for which we test pointwise regression versus pairwise learning-to-rank. Our contributions, originating from comprehensive evaluations against nine existing systems, are (1) a case study in which two preference ordering approaches are systematically compared, and where the pointwise approach is shown to outperform the pairwise approach, thus adding to the existing survey of observations on this topic; (2) since our system outperforms a highly-effective TF-IDF-based IR technique by 3.5 and 4.9 points on the development and test sets, respectively, it demonstrates some of the further task improvement possibilities (e.g., in terms of an efficient learning algorithm, semantic features) on this task; (3) it is a practically competent approach that can outperform some variants of BERT-based reranking models; and (4) the human-engineered features make it an interpretable machine learning model for the task.

Type
Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Angeli, G., Premkumar, M.J.J. and Manning, C.D. (2015). Leveraging linguistic structure for open domain information extraction. In ACL.CrossRefGoogle Scholar
Banerjee, P. (2019). Asu at textgraphs 2019 shared task: Explanation regeneration using language models and iterative re-ranking. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 7884.Google Scholar
Bauer, L., Wang, Y. and Bansal, M. (2018). Commonsense for generative multi-hop question answering tasks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 42204230.CrossRefGoogle Scholar
Chia, Y.K., Witteveen, S. and Andrews, M. (2019). Red dragon ai at textgraphs 2019 shared task: Language model assisted explanation generation. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 8589.CrossRefGoogle Scholar
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C. and Tafjord, O. (2018). Think you have Solved Question Answering? Try Arc, the ai2 Reasoning Challenge. ArXiv, abs/1803.05457.Google Scholar
Clark, P., Etzioni, O., Khashabi, D., Khot, T., Mishra, B.D., Richardson, K., Sabharwal, A., Schoenick, C., Tafjord, O., Tandon, N., Bhakthavatsalam, S., Groeneveld, D., Guerquin, M. and Schmitz, M. (2019). From ‘F’ to ‘A’ on the N.Y. regents science exams: An overview of the aristo project. arXiv preprintarXiv: abs/1909.01958.Google Scholar
Clark, P., Harrison, P. and Balasubramanian, N. (2013). A study of the knowledge base requirements for passing an elementary science test. In Proceedings of the 2013 Workshop on Automated Knowledge Base Construction. ACM, pp. 37–42.CrossRefGoogle Scholar
Das, R., Godbole, A., Zaheer, M., Dhuliawala, S. and McCallum, A. (2019). Chains-of-reasoning at textgraphs 2019 shared task: Reasoning over chains of facts for explainable multi-hop inference. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 101117.CrossRefGoogle Scholar
Devlin, J., Chang, M., Lee, K. and Toutanova, K. (2018). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805.Google Scholar
D’Souza, J., Mulang, I.O. and Auer, S. (2019). Team svmrank: Leveraging feature-rich support vector machines for ranking explanations to elementary science questions. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 90100.Google Scholar
Fried, D., Jansen, P., Hahn-Powell, G., Surdeanu, M. and Clark, P. (2015). Higher-order lexical semantic models for non-factoid answer reranking. Transactions of the Association for Computational Linguistics 3, 197–210.CrossRefGoogle Scholar
Fürnkranz, J. and Hüllermeier, E. (2010). Preference learning and ranking by pairwise comparison. In Preference Learning. Springer, pp. 65–82.CrossRefGoogle Scholar
Jansen, P., Balasubramanian, N., Surdeanu, M. and Clark, P. (2016). What’s in an explanation? characterizing knowledge and inference requirements for elementary science exams. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 29562965.Google Scholar
Jansen, P., Sharp, R., Surdeanu, M. and Clark, P. (2017). Framing QA as building and ranking intersentence answer justifications. Computational Linguistics 43(2), 407449.CrossRefGoogle Scholar
Jansen, P. and Ustalov, D. (2019). Textgraphs 2019 shared task on multi-hop inference for explanation regeneration. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 6377.CrossRefGoogle Scholar
Jansen, P., Wainwright, E., Marmorstein, S. and Morrison, C.T. 2018. Worldtree: A corpus of explanation graphs for elementary science questions supporting multi-hop inference. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC).Google Scholar
Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines, vol. 668. Springer Science & Business Media.CrossRefGoogle Scholar
Joachims, T. (2006). Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 217–226.CrossRefGoogle Scholar
Kamishima, T., Kazawa, H. and Akaho, S. (2005). Supervised ordering-an empirical survey. In Fifth IEEE International Conference on Data Mining (ICDM’05). IEEE, pp. 4–pp.CrossRefGoogle Scholar
Kamishima, T., Kazawa, H. and Akaho, S. (2010). A survey and empirical comparison of object ranking methods. In Preference Learning. Springer, pp. 181–201.CrossRefGoogle Scholar
Khashabi, D., Azer, E.S., Khot, T., Sabharwal, A. and Roth, D. (2019). On the Capabilities and Limitations of Reasoning for Natural Language Understanding. arXiv preprint arXiv:1901.02522.Google Scholar
Khot, T., Clark, P., Guerquin, M., Jansen, P. and Sabharwal, A. (2020). QASC: A dataset for question answering via sentence composition. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020. AAAI Press, pp. 80828090.CrossRefGoogle Scholar
Khot, T., Sabharwal, A. and Clark, P. (2018). Scitail: A textual entailment dataset from science question answering. In AAAI.CrossRefGoogle Scholar
Liu, H. and Singh, P. (2004). Conceptnet–a practical commonsense reasoning tool-kit. BT Technology Journal 22(4), 211226.CrossRefGoogle Scholar
Lundberg, S.M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 47654774.Google Scholar
Manning, C.D. (2015). Computational linguistics and deep learning. Computational Linguistics 41(4), 701707.CrossRefGoogle Scholar
Melnikov, V., Gupta, P., Frick, B., Kaimann, D. and Hüllermeier, E. (2016). Pairwise versus pointwise ranking: A case study. Schedae Informaticae 25, 7383.Google Scholar
Mitra, A. and Baral, C. (2015). Learning to automatically solve logic grid puzzles. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 10231033.CrossRefGoogle Scholar
Mitra, A., Clark, P., Tafjord, O. and Baral, C. (2019). Declarative question answering over knowledge bases containing natural language text with answer set programming. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3003–3010.CrossRefGoogle Scholar
Mittelstadt, B., Russell, C. and Wachter, S. (2019). Explaining explanations in AI. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, pp. 279288.CrossRefGoogle Scholar
Nogueira, R. and Cho, K. (2019). Passage Re-Ranking with Bert. arXiv preprint arXiv:1901.04085.Google Scholar
Parikh, A.P., Täckström, O., Das, D. and Uszkoreit, J. (2016). A decomposable attention model for natural language inference. In EMNLP.CrossRefGoogle Scholar
Paul, D. and Frank, A. (2019). Ranking and selecting multi-hop knowledge paths to better predict human needs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 36713681.Google Scholar
Pirtoaca, G.S., Rebedea, T. and Ruseti, S. (2019). Answering questions by learning to rank-learning to rank by answering questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 25312540.CrossRefGoogle Scholar
Ribeiro, M.T., Singh, S. and Guestrin, C. (2016). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 1135–1144.CrossRefGoogle Scholar
Seo, M.J., Kembhavi, A., Farhadi, A. and Hajishirzi, H. (2016). Bidirectional Attention Flow for Machine Comprehension. ArXiv, abs/1611.01603.Google Scholar
Speer, R., Chin, J. and Havasi, C. (2017). Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-First AAAI Conference on Artificial Intelligence.Google Scholar
Swayamdipta, S., Thomson, S., Dyer, C. and Smith, N.A. (2017). Frame-Semantic Parsing with Softmax-Margin Segmental RNNs and a Syntactic Scaffold. arXiv preprint arXiv:1706.09528.Google Scholar
Turing, A.M. (2009). Computing machinery and intelligence. In Parsing the Turing Test. Springer, pp. 2365.CrossRefGoogle Scholar
Vapnik, V.N. (1999). The nature of statistical learning theory. Springer science & business media.Google Scholar