Hostname: page-component-77c89778f8-fv566 Total loading time: 0 Render date: 2024-07-18T20:45:44.478Z Has data issue: false hasContentIssue false

Multi-modal interaction with transformers: bridging robots and human with natural language

Published online by Cambridge University Press:  13 November 2023

Shaochen Wang
Affiliation:
Department of Automation, University of Science and Technology of China, Hefei, 230026, China
Zhangli Zhou
Affiliation:
Department of Automation, University of Science and Technology of China, Hefei, 230026, China
Bin Li
Affiliation:
Department of Electronic and Information Science, University of Science and Technology of China, Hefei, 230026, China
Zhijun Li
Affiliation:
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230031, China School of Mechanical Engineering, Tongji University, Shanghai, 200092, China
Zhen Kan*
Affiliation:
Department of Automation, University of Science and Technology of China, Hefei, 230026, China
*
Corresponding author: Zhen Kan; Email: zkan@ustc.edu.cn

Abstract

The language-guided visual robotic grasping task focuses on enabling robots to grasp objects based on human language instructions. However, real-world human-robot collaboration tasks often involve situations with ambiguous language instructions and complex scenarios. These challenges arise in the understanding of linguistic queries, discrimination of key concepts in visual and language information, and generation of executable grasping configurations for the robot’s end-effector. To overcome these challenges, we propose a novel multi-modal transformer-based framework in this study, which assists robots in localizing spatial interactions of objects using text queries and visual sensing. This framework facilitates object grasping in accordance with human instructions. Our developed framework consists of two main components. First, a visual-linguistic transformer encoder is employed to model multi-modal interactions for objects referred to in the text. Second, the framework performs joint spatial localization and grasping. Extensive ablation studies have been conducted on multiple datasets to evaluate the advantages of each component in our model. Additionally, physical experiments have been performed with natural language-driven human-robot interactions on a physical robot to validate the practicality of our approach.

Type
Research Article
Copyright
© The Author(s), 2023. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Yan, W. and Mehta, A., “Towards one-dollar robots: An integrated design and fabrication strategy for electromechanical systems,” Robotica 41(1), 3147 (2023).10.1017/S0263574720001101CrossRefGoogle Scholar
Spielberg, A., Du, T., Hu, Y., Rus, D. and Matusik, W., “Advanced soft robot modeling in chainqueen,” Robotica 41(1), 74104 (2023).CrossRefGoogle Scholar
Zhou, X., Ye, J., Wang, C., Zhong, J. and Wu, X., “Time-frequency feature transform suite for deep learning-based gesture recognition using semg signals,” Robotica 41(2), 775788 (2023).CrossRefGoogle Scholar
Su, H., Schmirander, Y., Valderrama-Hincapié, S. E., Qi, W., Ovur, S. E. and Sandoval, J., “Neural-learning-enhanced cartesian admittance control of robot with moving RCM constraints,” Robotica 41(4), 12311243 (2023).10.1017/S0263574722001679CrossRefGoogle Scholar
Li, Z., Li, G., Wu, X., Kan, Z., Su, H. and Liu, Y., “Asymmetric cooperation control of dual-arm exoskeletons using human collaborative manipulation models,” IEEE Trans. Cybern. 52(11), 1212612139 (2022).CrossRefGoogle ScholarPubMed
Rosenblum, L. D., See What I’m Saying: The Extraordinary Powers of Our Five Senses (WW Norton & Company, New York, 2011).Google Scholar
Krishnamurthy, J. and Kollar, T., “Jointly learning to parse and perceive: Connecting natural language to the physical world,” Trans. Assoc. Comput. Linguist. 1, 193206 (2013).10.1162/tacl_a_00220CrossRefGoogle Scholar
Chen, Y., Xu, R., Lin, Y. and Vela, P. A., “A Joint Network for Grasp Detection Conditioned on Natural Language Commands,” In: Proc. IEEE Int. Conf. Robot. Automat. (2021) pp. 45764582.Google Scholar
Shao, L., Migimatsu, T., Zhang, Q., Yang, K. and Bohg, J., “Concept2robot: Learning Manipulation Concepts From Instructions and Human Demonstrations,” In: Robotics: Science and Systems (2020).Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” In: Proc. Assoc. Comput. Linguistics (2019) pp. 41714186.Google Scholar
Arkin, J., Park, D., Roy, S., Walter, M. R., Roy, N., Howard, T. M. and Paul, R., “Multimodal estimation and communication of latent semantic knowledge for robust execution of robot instructions,” Int. J. Robot. Res. 39(10-11), 12791304 (2020).10.1177/0278364920917755CrossRefGoogle Scholar
Li, Z., Li, Q., Huang, P., Xia, H. and Li, G., “Human-in-the-loop adaptive control of a soft exo-suit with actuator dynamics and ankle impedance adaptation,” IEEE Trans. Cybern. 113 (2023).Google ScholarPubMed
Li, Z., Zhao, K., Zhang, L., Wu, X., Zhang, T., Li, Q., Li, X. and Su, C.-Y., “Human-in-the-loop control of a wearable lower limb exoskeleton for stable dynamic walking,” IEEE ASME Trans. Mechatron. 26(5), 27002711 (2021).10.1109/TMECH.2020.3044289CrossRefGoogle Scholar
Wang, Z., Li, Z., Wang, B. and Liu, H., “Robot grasp detection using multimodal deep convolutional neural networks,” Adv. Mech. Eng. 8(9), 1687814016668077 (2016).Google Scholar
Ainetter, S. and Fraundorfer, F., “End-to-End Trainable Deep Neural Network for Robotic Grasp Detection and Semantic Segmentation from RGB,” In: Proc. IEEE Int. Conf. Robot. Automat. (IEEE, 2021) pp. 1345213458.Google Scholar
Lenz, I., Lee, H. and Saxena, A., “Deep learning for detecting robotic grasps,” Int. J. Robot. Res. 34(4-5), 705724 (2015).CrossRefGoogle Scholar
Di, G., Sun, F., Liu, H., Kong, T., Fang, B. and Xi, N., “A Hybrid Deep Architecture for Robotic Grasp Detection,” In: Proc. IEEE Int. Conf. Robot. Automat. (2017) pp. 16091614.Google Scholar
Wang, S., Zhou, Z. and Kan, Z., “When transformer meets robotic grasping: Exploits context for efficient grasp detection,” IEEE Robot. Autom. Lett. 7(3), 81708177 (2022).10.1109/LRA.2022.3187261CrossRefGoogle Scholar
Zhou, Z., Wang, S., Chen, Z., Cai, M., Wang, H., Li, Z. and Kan, Z., “Local observation based reactive temporal logic planning of human-robot systems,” IEEE Trans. Autom. Sci. Eng. 113 (2023).Google Scholar
Wang, S., Zhou, Z., Wang, H., Li, Z. and Kan, Z., “Unsupervised Representation Learning for Visual Robotics Grasping,” In: International Conference on Advanced Robotics and Mechatronics (ICARM) (2022) pp. 5762.10.1109/ICARM54641.2022.9959267CrossRefGoogle Scholar
Chen, K., Wang, S., Xia, B., Li, D., Kan, Z. and Li, B., “Tode-Trans: Transparent Object Depth Estimation with Transformer,” In: IEEE Int. Conf. Robot. Autom. (2023) pp. 48804886.Google Scholar
Wang, S., Zhang, W., Zhou, Z., Cao, J., Chen, Z., Chen, K., Li, B. and Kan, Z., “What you see is what you grasp: User-friendly grasping guided by near-eye-tracking,” arXiv preprint arXiv:2209.06122 (2022).Google Scholar
Wang, P., Wu, Q., Cao, J., Shen, C., Gao, L. and van den Hengel, A., “Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2019) pp. 19601968.Google Scholar
Plummer, B. A., Kordas, P., Kiapour, M. H., Zheng, S., Piramuthu, R. and Lazebnik, S., “Conditional Image-Text Embedding Networks,” In: Proc. Eur. Conf. Comput. Vis., vol. 11216 (2018) pp. 258274.Google Scholar
Zhang, H., Niu, Y. and Chang, S.-F., “Grounding Referring Expressions in Images by Variational Context,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2018) pp. 41584166.Google Scholar
Zhuang, B., Wu, Q., Shen, C., Reid, I. D. and van den Hengel, A., “Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2018) pp. 42524261.Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L. and Murphy, K., “Generation and Comprehension of Unambiguous Object Descriptions,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2016) pp. 1120.Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A. C. and Berg, T. L., “Modeling Context in Referring Expressions,” In: Proc. Eur. Conf. Comput. Vis., vol. 9906 (2016) pp. 6985.Google Scholar
Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M. and Berg, T. L., “Modular Attention Network for Referring Expression Comprehension,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2018) pp. 13071315.Google Scholar
Chen, X., Ma, L., Chen, J., Jie, Z., Liu, W. and Luo, J., “Real-time referring expression comprehension by single-stage grounding network,” CoRR, abs/1812.03426 (2018).Google Scholar
Huang, B., Lian, D., Luo, W. and Gao, S., “Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2021) pp. 1688816897.Google Scholar
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D. and Luo, J., “A Fast and Accurate One-stage Approach to Visual Grounding,” In: Proc. IEEE Int. Conf. Comput. Vis. (2019) pp. 46824692.Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I., “Attention Is All You Need,” In: Proc. Adv. Neural Inf. Process. Syst., vol. 30 (2017).Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. and Zagoruyko, S., “End-to-End Object Detection with Transformers,” In: Proc. Eur. Conf. Comput. Vis. (Springer, Cham, 2020) pp. 213229.Google Scholar
Depierre, A., Dellandréa, E. and Chen, L., “Jacquard: A Large Scale Dataset for Robotic Grasp Detection,” In: Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (2018) pp. 35113516.Google Scholar
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J. and Lazebnik, S., “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” Int. J. Comput. Vis. 123(1), 7493 (2017).CrossRefGoogle Scholar
Young, P., Lai, A., Hodosh, M. and Hockenmaier, J., “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguist. 2, 6778 (2014).CrossRefGoogle Scholar
Asif, U., Tang, J. and Harrer, S., “Graspnet: An Efficient Convolutional Neural Network for Real-Time Grasp Detection for Low-Powered Devices,” In: IJCAI, vol. 7 (2018) pp. 48754882.Google Scholar
Kumra, S. and Kanan, C., “Robotic Grasp Detection Using Deep Convolutional Neural Networks,” In: Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (2017) pp. 769776.Google Scholar
Shridhar, M., Mittal, D. and Hsu, D., “INGRESS: Interactive visual grounding of referring expressions,” Int. J. Robot. Res. 39(2-3), 217232 (2020).10.1177/0278364919897133CrossRefGoogle Scholar
Nagaraja, V. K., Morariu, V. I. and Davis, L. S., “Modeling Context Between Objects for Referring Expression Understanding,” In: Proc. Eur. Conf. Comput. Vis., vol. 9908 (2016) pp. 792807.Google Scholar
Hu, R., Rohrbach, M., Andreas, J., Darrell, T. and Saenko, K., “Modeling Relationships in Referential Expressions with Compositional Modular Networks,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2017) pp. 44184427.Google Scholar
Yang, S., Li, G. and Yu, Y., “Dynamic Graph Attention for Referring Expression Comprehension,” In: Proc. IEEE Int. Conf. Comput. Vis. (2019) pp. 46434652.Google Scholar
Hong, R., Liu, D., Mo, X., He, X. and Zhang, H., “Learning to compose and reason with language tree structures for visual grounding,” IEEE Trans. Pattern Anal. Mach. Intell. 44(2), 684696 (2022).CrossRefGoogle ScholarPubMed
Liu, D., Zhang, H., Zha, Z.-J. and Wu, F., “Learning to Assemble Neural Module Tree Networks for Visual Grounding,” In: Proc. IEEE Int. Conf. Comput. Vis. (2019) pp. 46724681.Google Scholar
Chen, L., Ma, W., Xiao, J., Zhang, H. and Chang, S.-F., “Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding,” In: AAAI Conf. Artif. Intell. (2021) pp. 10361044.Google Scholar
Yang, S., Li, G. and Yu, Y., “Relationship-embedded representation learning for grounding referring expressions,” IEEE Trans. Pattern Anal. Mach. Intell. 43(8), 27652779 (2021).CrossRefGoogle ScholarPubMed
Liu, X., Wang, Z., Shao, J., Wang, X. and Li, H., “Improving Referring Expression Grounding with Cross-Modal Attention-Guided Erasing,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2019) pp. 19501959.Google Scholar
Liao, Y., Liu, S., Li, G., Wang, F., Chen, Y., Qian, C. and Li, B., “A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2020) pp. 1087710886.Google Scholar
Yang, Z., Chen, T., Wang, L. and Luo, J., “Improving One-Stage Visual Grounding by Recursive Sub-Query Construction,” In: Proc. Eur. Conf. Comput. Vis., vol. 12359 (2020) pp. 387404.Google Scholar
Du, Y., Fu, Z., Liu, Q. and Wang, Y., “Visual Grounding with Transformers,” In: Proc. IEEE Conf. Multi. Exp. (2022) pp. 16.Google Scholar
Li, M. and Sigal, L., “Referring Transformer: A One-Step Approach to Multi-Task Visual Grounding,” In: Proc. Int. Conf. Neural Inf. Process. Syst. (2021) pp. 1965219664.Google Scholar
Deng, J., Yang, Z., Chen, T., Zhou, W. and Li, H., “TransVG: End-to-End Visual Grounding with Transformers,” In: Proc. IEEE Int. Conf. Comput. Vis. (2021) pp. 17491759.Google Scholar
Wang, L., Li, Y., Huang, J. and Lazebnik, S., “Learning two-branch neural networks for image-text matching tasks,” IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394407 (2019).10.1109/TPAMI.2018.2797921CrossRefGoogle ScholarPubMed
Kovvuri, R. and Nevatia, R., “PIRC Net: Using Proposal Indexing, Relationships and Context for Phrase Grounding,” In: Asian Conf. Comput. Vis., vol. 11364 (2018) pp. 451467.Google Scholar
Yu, Z., Xiang, C., Zhao, Z., Tian, Q. and Tao, D., “Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding,” In: Int. Joint Conf. Artif. Intell. (2018) pp. 11141120.Google Scholar
Liu, Y., Wan, B., Zhu, X. and He, X., “Learning Cross-Modal Context Graph for Visual Grounding,” In: AAAI Conf. Artif. Intell. (2020) pp. 1164511652.Google Scholar
Sadhu, A., Chen, K. and Nevatia, R., “Zero-Shot Grounding of Objects from Natural Language Queries,” In: Proc. IEEE Int. Conf. Comput. Vis. (2019) pp. 46934702.Google Scholar
Jiang, Y., Moseson, S. and Saxena, A., “Efficient Grasping from RGBD Images: Learning Using a New Rectangle Representation,” In: Proc. IEEE Int. Conf. Robot. Automat. (2011) pp. 33043311.Google Scholar
Redmon, J. and Angelova, A., “Real-Time Grasp Detection Using Convolutional Neural Networks,” In: Proc. IEEE Int. Conf. Robot. Autom. (2015) pp. 13161322.Google Scholar
Asif, U., Bennamoun, M. and Sohel, F. A., “RGB-D object recognition and grasp detection using hierarchical cascaded forests,” IEEE Trans. Robot. 33(3), 547564 (2017).10.1109/TRO.2016.2638453CrossRefGoogle Scholar
Morrison, D., Corke, P. and Leitner, J., “Learning robust, real-time, reactive robotic grasping,” Int. J. Robot. Res. 39(2-3), 183201 (2020).CrossRefGoogle Scholar
Zhou, X., Lan, X., Zhang, H., Tian, Z., Zhang, Y. and Zheng, N., “Fully Convolutional Grasp Detection Network with Oriented Anchor Box,” In: Proc. IEEE Int. Conf. Intell. Robots Syst. (2018) pp. 72237230.Google Scholar
Karaoguz, H. and Jensfelt, P., “Object Detection Approach for Robot Grasp Detection,” In: Proc. IEEE Int. Conf. Robot. Automat. (2019) pp. 49534959.Google Scholar
Gariépy, A., Ruel, J.-C., Chaib-Draa, B. and Giguere, P., “GQ-STN: Optimizing One-Shot Grasp Detection Based on Robustness Classifier,” In: Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (2019) pp. 39964003.Google Scholar
Zhang, H., Lan, X., Bai, S., Zhou, X., Tian, Z. and Zheng, N., “ROI-Based Robotic Grasp Detection for Object Overlapping Scenes,” In: Proc. IEEE Int. Conf. Intell. Robots Syst. (2019) pp. 47684775.Google Scholar
Kumra, S., Joshi, S. and Sahin, F., “Antipodal Robotic Grasping Using Generative Residual Convolutional Neural Network,” In: Proc. IEEE Int. Conf. Intell. Robots Syst. (IEEE, 2020) pp. 96269633.Google Scholar