Multi-modal interaction with transformers: bridging robots and human with natural language

Shaochen Wang; Zhangli Zhou; Bin Li; Zhijun Li; Zhen Kan

doi:10.1017/S0263574723001510

Multi-modal interaction with transformers: bridging robots and human with natural language

Published online by Cambridge University Press: 13 November 2023

Shaochen Wang

Zhangli Zhou ,

Bin Li ,

Zhijun Li and

Zhen Kan

Show author details

Shaochen Wang: Affiliation:
Department of Automation, University of Science and Technology of China, Hefei, 230026, China
Zhangli Zhou: Affiliation:
Department of Automation, University of Science and Technology of China, Hefei, 230026, China
Bin Li: Affiliation:
Department of Electronic and Information Science, University of Science and Technology of China, Hefei, 230026, China
Zhijun Li: Affiliation:
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230031, China School of Mechanical Engineering, Tongji University, Shanghai, 200092, China
Zhen Kan*: Affiliation:
Department of Automation, University of Science and Technology of China, Hefei, 230026, China
*: Corresponding author: Zhen Kan; Email: zkan@ustc.edu.cn

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The language-guided visual robotic grasping task focuses on enabling robots to grasp objects based on human language instructions. However, real-world human-robot collaboration tasks often involve situations with ambiguous language instructions and complex scenarios. These challenges arise in the understanding of linguistic queries, discrimination of key concepts in visual and language information, and generation of executable grasping configurations for the robot’s end-effector. To overcome these challenges, we propose a novel multi-modal transformer-based framework in this study, which assists robots in localizing spatial interactions of objects using text queries and visual sensing. This framework facilitates object grasping in accordance with human instructions. Our developed framework consists of two main components. First, a visual-linguistic transformer encoder is employed to model multi-modal interactions for objects referred to in the text. Second, the framework performs joint spatial localization and grasping. Extensive ablation studies have been conducted on multiple datasets to evaluate the advantages of each component in our model. Additionally, physical experiments have been performed with natural language-driven human-robot interactions on a physical robot to validate the practicality of our approach.

Keywords

multi-modal robot perception robotic grasping human-robot interaction

Type: Research Article
Information: Robotica , Volume 42 , Issue 2 , February 2024 , pp. 415 - 434

DOI: https://doi.org/10.1017/S0263574723001510 [Opens in a new window]
Copyright: © The Author(s), 2023. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Yan, W. and Mehta, A., “Towards one-dollar robots: An integrated design and fabrication strategy for electromechanical systems,” Robotica 41(1), 31–47 (2023).10.1017/S0263574720001101CrossRef Google Scholar

Spielberg, A., Du, T., Hu, Y., Rus, D. and Matusik, W., “Advanced soft robot modeling in chainqueen,” Robotica 41(1), 74–104 (2023).CrossRef Google Scholar

Zhou, X., Ye, J., Wang, C., Zhong, J. and Wu, X., “Time-frequency feature transform suite for deep learning-based gesture recognition using semg signals,” Robotica 41(2), 775–788 (2023).CrossRef Google Scholar

Su, H., Schmirander, Y., Valderrama-Hincapié, S. E., Qi, W., Ovur, S. E. and Sandoval, J., “Neural-learning-enhanced cartesian admittance control of robot with moving RCM constraints,” Robotica 41(4), 1231–1243 (2023).10.1017/S0263574722001679CrossRef Google Scholar

Li, Z., Li, G., Wu, X., Kan, Z., Su, H. and Liu, Y., “Asymmetric cooperation control of dual-arm exoskeletons using human collaborative manipulation models,” IEEE Trans. Cybern. 52(11), 12126–12139 (2022).CrossRef Google Scholar PubMed

Rosenblum, L. D., See What I’m Saying: The Extraordinary Powers of Our Five Senses (WW Norton & Company, New York, 2011).Google Scholar

Krishnamurthy, J. and Kollar, T., “Jointly learning to parse and perceive: Connecting natural language to the physical world,” Trans. Assoc. Comput. Linguist. 1, 193–206 (2013).10.1162/tacl_a_00220CrossRef Google Scholar

Chen, Y., Xu, R., Lin, Y. and Vela, P. A., “A Joint Network for Grasp Detection Conditioned on Natural Language Commands,” In: Proc. IEEE Int. Conf. Robot. Automat. (2021) pp. 4576–4582.Google Scholar

Shao, L., Migimatsu, T., Zhang, Q., Yang, K. and Bohg, J., “Concept2robot: Learning Manipulation Concepts From Instructions and Human Demonstrations,” In: Robotics: Science and Systems (2020).Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” In: Proc. Assoc. Comput. Linguistics (2019) pp. 4171–4186.Google Scholar

Arkin, J., Park, D., Roy, S., Walter, M. R., Roy, N., Howard, T. M. and Paul, R., “Multimodal estimation and communication of latent semantic knowledge for robust execution of robot instructions,” Int. J. Robot. Res. 39(10-11), 1279–1304 (2020).10.1177/0278364920917755CrossRef Google Scholar

Li, Z., Li, Q., Huang, P., Xia, H. and Li, G., “Human-in-the-loop adaptive control of a soft exo-suit with actuator dynamics and ankle impedance adaptation,” IEEE Trans. Cybern. 1–13 (2023).Google Scholar PubMed

Li, Z., Zhao, K., Zhang, L., Wu, X., Zhang, T., Li, Q., Li, X. and Su, C.-Y., “Human-in-the-loop control of a wearable lower limb exoskeleton for stable dynamic walking,” IEEE ASME Trans. Mechatron. 26(5), 2700–2711 (2021).10.1109/TMECH.2020.3044289CrossRef Google Scholar

Wang, Z., Li, Z., Wang, B. and Liu, H., “Robot grasp detection using multimodal deep convolutional neural networks,” Adv. Mech. Eng. 8(9), 1687814016668077 (2016).Google Scholar

Ainetter, S. and Fraundorfer, F., “End-to-End Trainable Deep Neural Network for Robotic Grasp Detection and Semantic Segmentation from RGB,” In: Proc. IEEE Int. Conf. Robot. Automat. (IEEE, 2021) pp. 13452–13458.Google Scholar

Lenz, I., Lee, H. and Saxena, A., “Deep learning for detecting robotic grasps,” Int. J. Robot. Res. 34(4-5), 705–724 (2015).CrossRef Google Scholar

Di, G., Sun, F., Liu, H., Kong, T., Fang, B. and Xi, N., “A Hybrid Deep Architecture for Robotic Grasp Detection,” In: Proc. IEEE Int. Conf. Robot. Automat. (2017) pp. 1609–1614.Google Scholar

Wang, S., Zhou, Z. and Kan, Z., “When transformer meets robotic grasping: Exploits context for efficient grasp detection,” IEEE Robot. Autom. Lett. 7(3), 8170–8177 (2022).10.1109/LRA.2022.3187261CrossRef Google Scholar

Zhou, Z., Wang, S., Chen, Z., Cai, M., Wang, H., Li, Z. and Kan, Z., “Local observation based reactive temporal logic planning of human-robot systems,” IEEE Trans. Autom. Sci. Eng. 1–13 (2023).Google Scholar

Wang, S., Zhou, Z., Wang, H., Li, Z. and Kan, Z., “Unsupervised Representation Learning for Visual Robotics Grasping,” In: International Conference on Advanced Robotics and Mechatronics (ICARM) (2022) pp. 57–62.10.1109/ICARM54641.2022.9959267CrossRef Google Scholar

Chen, K., Wang, S., Xia, B., Li, D., Kan, Z. and Li, B., “Tode-Trans: Transparent Object Depth Estimation with Transformer,” In: IEEE Int. Conf. Robot. Autom. (2023) pp. 4880–4886.Google Scholar

Wang, S., Zhang, W., Zhou, Z., Cao, J., Chen, Z., Chen, K., Li, B. and Kan, Z., “What you see is what you grasp: User-friendly grasping guided by near-eye-tracking,” arXiv preprint arXiv:2209.06122 (2022).Google Scholar

Wang, P., Wu, Q., Cao, J., Shen, C., Gao, L. and van den Hengel, A., “Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2019) pp. 1960–1968.Google Scholar

Plummer, B. A., Kordas, P., Kiapour, M. H., Zheng, S., Piramuthu, R. and Lazebnik, S., “Conditional Image-Text Embedding Networks,” In: Proc. Eur. Conf. Comput. Vis., vol. 11216 (2018) pp. 258–274.Google Scholar

Zhang, H., Niu, Y. and Chang, S.-F., “Grounding Referring Expressions in Images by Variational Context,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2018) pp. 4158–4166.Google Scholar

Zhuang, B., Wu, Q., Shen, C., Reid, I. D. and van den Hengel, A., “Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2018) pp. 4252–4261.Google Scholar

Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L. and Murphy, K., “Generation and Comprehension of Unambiguous Object Descriptions,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2016) pp. 11–20.Google Scholar

Yu, L., Poirson, P., Yang, S., Berg, A. C. and Berg, T. L., “Modeling Context in Referring Expressions,” In: Proc. Eur. Conf. Comput. Vis., vol. 9906 (2016) pp. 69–85.Google Scholar

Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M. and Berg, T. L., “Modular Attention Network for Referring Expression Comprehension,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2018) pp. 1307–1315.Google Scholar

Chen, X., Ma, L., Chen, J., Jie, Z., Liu, W. and Luo, J., “Real-time referring expression comprehension by single-stage grounding network,” CoRR, abs/1812.03426 (2018).Google Scholar

Huang, B., Lian, D., Luo, W. and Gao, S., “Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2021) pp. 16888–16897.Google Scholar

Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D. and Luo, J., “A Fast and Accurate One-stage Approach to Visual Grounding,” In: Proc. IEEE Int. Conf. Comput. Vis. (2019) pp. 4682–4692.Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I., “Attention Is All You Need,” In: Proc. Adv. Neural Inf. Process. Syst., vol. 30 (2017).Google Scholar

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. and Zagoruyko, S., “End-to-End Object Detection with Transformers,” In: Proc. Eur. Conf. Comput. Vis. (Springer, Cham, 2020) pp. 213–229.Google Scholar

Depierre, A., Dellandréa, E. and Chen, L., “Jacquard: A Large Scale Dataset for Robotic Grasp Detection,” In: Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (2018) pp. 3511–3516.Google Scholar

Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J. and Lazebnik, S., “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” Int. J. Comput. Vis. 123(1), 74–93 (2017).CrossRef Google Scholar

Young, P., Lai, A., Hodosh, M. and Hockenmaier, J., “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguist. 2, 67–78 (2014).CrossRef Google Scholar

Asif, U., Tang, J. and Harrer, S., “Graspnet: An Efficient Convolutional Neural Network for Real-Time Grasp Detection for Low-Powered Devices,” In: IJCAI, vol. 7 (2018) pp. 4875–4882.Google Scholar

Kumra, S. and Kanan, C., “Robotic Grasp Detection Using Deep Convolutional Neural Networks,” In: Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (2017) pp. 769–776.Google Scholar

Shridhar, M., Mittal, D. and Hsu, D., “INGRESS: Interactive visual grounding of referring expressions,” Int. J. Robot. Res. 39(2-3), 217–232 (2020).10.1177/0278364919897133CrossRef Google Scholar

Nagaraja, V. K., Morariu, V. I. and Davis, L. S., “Modeling Context Between Objects for Referring Expression Understanding,” In: Proc. Eur. Conf. Comput. Vis., vol. 9908 (2016) pp. 792–807.Google Scholar

Hu, R., Rohrbach, M., Andreas, J., Darrell, T. and Saenko, K., “Modeling Relationships in Referential Expressions with Compositional Modular Networks,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2017) pp. 4418–4427.Google Scholar

Yang, S., Li, G. and Yu, Y., “Dynamic Graph Attention for Referring Expression Comprehension,” In: Proc. IEEE Int. Conf. Comput. Vis. (2019) pp. 4643–4652.Google Scholar

Hong, R., Liu, D., Mo, X., He, X. and Zhang, H., “Learning to compose and reason with language tree structures for visual grounding,” IEEE Trans. Pattern Anal. Mach. Intell. 44(2), 684–696 (2022).CrossRef Google Scholar PubMed

Liu, D., Zhang, H., Zha, Z.-J. and Wu, F., “Learning to Assemble Neural Module Tree Networks for Visual Grounding,” In: Proc. IEEE Int. Conf. Comput. Vis. (2019) pp. 4672–4681.Google Scholar

Chen, L., Ma, W., Xiao, J., Zhang, H. and Chang, S.-F., “Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding,” In: AAAI Conf. Artif. Intell. (2021) pp. 1036–1044.Google Scholar

Yang, S., Li, G. and Yu, Y., “Relationship-embedded representation learning for grounding referring expressions,” IEEE Trans. Pattern Anal. Mach. Intell. 43(8), 2765–2779 (2021).CrossRef Google Scholar PubMed

Liu, X., Wang, Z., Shao, J., Wang, X. and Li, H., “Improving Referring Expression Grounding with Cross-Modal Attention-Guided Erasing,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2019) pp. 1950–1959.Google Scholar

Liao, Y., Liu, S., Li, G., Wang, F., Chen, Y., Qian, C. and Li, B., “A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2020) pp. 10877–10886.Google Scholar

Yang, Z., Chen, T., Wang, L. and Luo, J., “Improving One-Stage Visual Grounding by Recursive Sub-Query Construction,” In: Proc. Eur. Conf. Comput. Vis., vol. 12359 (2020) pp. 387–404.Google Scholar

Du, Y., Fu, Z., Liu, Q. and Wang, Y., “Visual Grounding with Transformers,” In: Proc. IEEE Conf. Multi. Exp. (2022) pp. 1–6.Google Scholar

Li, M. and Sigal, L., “Referring Transformer: A One-Step Approach to Multi-Task Visual Grounding,” In: Proc. Int. Conf. Neural Inf. Process. Syst. (2021) pp. 19652–19664.Google Scholar

Deng, J., Yang, Z., Chen, T., Zhou, W. and Li, H., “TransVG: End-to-End Visual Grounding with Transformers,” In: Proc. IEEE Int. Conf. Comput. Vis. (2021) pp. 1749–1759.Google Scholar

Wang, L., Li, Y., Huang, J. and Lazebnik, S., “Learning two-branch neural networks for image-text matching tasks,” IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2019).10.1109/TPAMI.2018.2797921CrossRef Google Scholar PubMed

Kovvuri, R. and Nevatia, R., “PIRC Net: Using Proposal Indexing, Relationships and Context for Phrase Grounding,” In: Asian Conf. Comput. Vis., vol. 11364 (2018) pp. 451–467.Google Scholar

Yu, Z., Xiang, C., Zhao, Z., Tian, Q. and Tao, D., “Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding,” In: Int. Joint Conf. Artif. Intell. (2018) pp. 1114–1120.Google Scholar

Liu, Y., Wan, B., Zhu, X. and He, X., “Learning Cross-Modal Context Graph for Visual Grounding,” In: AAAI Conf. Artif. Intell. (2020) pp. 11645–11652.Google Scholar

Sadhu, A., Chen, K. and Nevatia, R., “Zero-Shot Grounding of Objects from Natural Language Queries,” In: Proc. IEEE Int. Conf. Comput. Vis. (2019) pp. 4693–4702.Google Scholar

Jiang, Y., Moseson, S. and Saxena, A., “Efficient Grasping from RGBD Images: Learning Using a New Rectangle Representation,” In: Proc. IEEE Int. Conf. Robot. Automat. (2011) pp. 3304–3311.Google Scholar

Redmon, J. and Angelova, A., “Real-Time Grasp Detection Using Convolutional Neural Networks,” In: Proc. IEEE Int. Conf. Robot. Autom. (2015) pp. 1316–1322.Google Scholar

Asif, U., Bennamoun, M. and Sohel, F. A., “RGB-D object recognition and grasp detection using hierarchical cascaded forests,” IEEE Trans. Robot. 33(3), 547–564 (2017).10.1109/TRO.2016.2638453CrossRef Google Scholar

Morrison, D., Corke, P. and Leitner, J., “Learning robust, real-time, reactive robotic grasping,” Int. J. Robot. Res. 39(2-3), 183–201 (2020).CrossRef Google Scholar

Zhou, X., Lan, X., Zhang, H., Tian, Z., Zhang, Y. and Zheng, N., “Fully Convolutional Grasp Detection Network with Oriented Anchor Box,” In: Proc. IEEE Int. Conf. Intell. Robots Syst. (2018) pp. 7223–7230.Google Scholar

Karaoguz, H. and Jensfelt, P., “Object Detection Approach for Robot Grasp Detection,” In: Proc. IEEE Int. Conf. Robot. Automat. (2019) pp. 4953–4959.Google Scholar

Gariépy, A., Ruel, J.-C., Chaib-Draa, B. and Giguere, P., “GQ-STN: Optimizing One-Shot Grasp Detection Based on Robustness Classifier,” In: Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (2019) pp. 3996–4003.Google Scholar

Zhang, H., Lan, X., Bai, S., Zhou, X., Tian, Z. and Zheng, N., “ROI-Based Robotic Grasp Detection for Object Overlapping Scenes,” In: Proc. IEEE Int. Conf. Intell. Robots Syst. (2019) pp. 4768–4775.Google Scholar

Kumra, S., Joshi, S. and Sahin, F., “Antipodal Robotic Grasping Using Generative Residual Convolutional Neural Network,” In: Proc. IEEE Int. Conf. Intell. Robots Syst. (IEEE, 2020) pp. 9626–9633.Google Scholar

Article contents

Multi-modal interaction with transformers: bridging robots and human with natural language

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests