Hostname: page-component-848d4c4894-nr4z6 Total loading time: 0 Render date: 2024-05-01T01:48:45.723Z Has data issue: false hasContentIssue false

A Neuro-Symbolic ASP Pipeline for Visual Question Answering

Published online by Cambridge University Press:  11 July 2022

THOMAS EITER
Affiliation:
Institute of Logic and Computation, Vienna University of Technology (TU Wien), Austria (e-mails: eiter@kr.tuwien.ac.at, higuera@kr.tuwien.ac.at, oetsch@kr.tuwien.ac.at, pritz@kr.tuwien.ac.at)
NELSON HIGUERA
Affiliation:
Institute of Logic and Computation, Vienna University of Technology (TU Wien), Austria (e-mails: eiter@kr.tuwien.ac.at, higuera@kr.tuwien.ac.at, oetsch@kr.tuwien.ac.at, pritz@kr.tuwien.ac.at)
JOHANNES OETSCH
Affiliation:
Institute of Logic and Computation, Vienna University of Technology (TU Wien), Austria (e-mails: eiter@kr.tuwien.ac.at, higuera@kr.tuwien.ac.at, oetsch@kr.tuwien.ac.at, pritz@kr.tuwien.ac.at)
MICHAEL PRITZ
Affiliation:
Institute of Logic and Computation, Vienna University of Technology (TU Wien), Austria (e-mails: eiter@kr.tuwien.ac.at, higuera@kr.tuwien.ac.at, oetsch@kr.tuwien.ac.at, pritz@kr.tuwien.ac.at)

Abstract

We present a neuro-symbolic visual question answering (VQA) pipeline for CLEVR, which is a well-known dataset that consists of pictures showing scenes with objects and questions related to them. Our pipeline covers (i) training neural networks for object classification and bounding-box prediction of the CLEVR scenes, (ii) statistical analysis on the distribution of prediction values of the neural networks to determine a threshold for high-confidence predictions, and (iii) a translation of CLEVR questions and network predictions that pass confidence thresholds into logic programmes so that we can compute the answers using an answer-set programming solver. By exploiting choice rules, we consider deterministic and non-deterministic scene encodings. Our experiments show that the non-deterministic scene encoding achieves good results even if the neural networks are trained rather poorly in comparison with the deterministic approach. This is important for building robust VQA systems if network predictions are less-than perfect. Furthermore, we show that restricting non-determinism to reasonable choices allows for more efficient implementations in comparison with related neuro-symbolic approaches without losing much accuracy.

Type
Original Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*

This work was partially funded by the Bosch Center for Artificial Intelligence at Renningen, Germany.

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L. and Parikh, D. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) 2015, pp. 24252433. IEEE.CrossRefGoogle Scholar
Basu, K., Shakerin, F. and Gupta, G. AQuA: ASP-based visual question answering. In Proceedings of the 22nd International Symposium on Practical Aspects of Declarative Languages (PADL 2020) 2020, vol. 12007. Lecture Notes in Computer Science. Springer, 5772.Google Scholar
Brewka, G., Eiter, T. and TruszczyŃski, M. 2011. Answer set programming at a glance. Communications of the ACM, 54, 12, 92103.CrossRefGoogle Scholar
Gebser, M., Kaminski, R., Kaufmann, B. and Schaub, T. 2012. Answer Set Solving in Practice, volume 6 of Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers.Google Scholar
Gelfond, M. and Lifschitz, V. 1991. Classical negation in logic programs and disjunctive databases. New Generation Computing, 9, 34, 365385.CrossRefGoogle Scholar
Jabri, A., Joulin, A. and Van Der Maaten, L. Revisiting visual question answering baselines. In Proceedings of the 14th European Conference on Computer Vision (ECCV 2016) 2016, vol. 9912. Lecture Notes in Computer Science. Springer, 727739.Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L. and Girshick, R. B. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. IEEE, 19881997.CrossRefGoogle Scholar
Lu, J., Yang, J., Batra, D. and Parikh, D. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems (NIPS 2016) 2016, vol. 29. Curran Associates, Inc., 289297.Google Scholar
Malinowski, M. and Fritz, M. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS 2014) 2014, vol. 27. Curran Associates, Inc., 16821690.Google Scholar
Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T. and Raedt, L. D. DeepProbLog: Neural probabilistic logic programming. In Advances in Neural Information Processing Systems (NeurIPS 2018) 2018, vol. 31, 37533763.Google Scholar
Mao, J., Gan, C., Kohli, P., Tenenbaum, J. B. and Wu, J. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019) 2019.Google Scholar
Redmon, J. and Farhadi, A. 2018. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767, abs/1804.02767.Google Scholar
Ren, M., Kiros, R. and Zemel, R. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems (NIPS 2015) 2015, vol. 28. Curran Associates, Inc., 29532961.Google Scholar
Riley, H. and Sridharan, M. 2019. Integrating non-monotonic logical reasoning and inductive learning with deep learning for explainable visual question answering. Frontiers in Robotics and AI, 6:125.Google Scholar
Sampat, S. K., Kumar, A., Yang, Y. and Baral, C. CLEVR_HYP: A challenge dataset and baselines for visual question answering with hypothetical actions over images. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2021) 2021. Association for Computational Linguistics, 36923709.Google Scholar
Xu, J., Zhang, Z., Friedman, T., Liang, Y. and Van den Broeck, G. A semantic loss function for deep learning with symbolic knowledge. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018) 2018, vol. 80. Proceedings of Machine Learning Research. PMLR, 55025511.Google Scholar
Yang, Z., He, X., Gao, J., Deng, L. and Smola, A. Stacked attention networks for image question answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, 2129.CrossRefGoogle Scholar
Yang, Z., Ishay, A. and Lee, J. NeurASP: Embracing neural networks into answer set programming. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI 2020) 2020. International Joint Conferences on Artificial Intelligence Organization, 17551762.CrossRefGoogle Scholar
Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P. and Tenenbaum, J. Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems (NeurIPS 2018) 2018, vol. 39. Curran Associates, Inc., 10391050.Google Scholar
Zhu, Y., Groth, O., Bernstein, M. and Fei-Fei, L. Visual7w: Grounded question answering in images. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, 49955004.CrossRefGoogle Scholar