Journal of East China Normal University(Natural Science) ›› 2024, Vol. 2024 ›› Issue (2): 65-75.doi: 10.3969/j.issn.1000-5641.2024.02.008

• Computer Science • Previous Articles     Next Articles

Dual-path network with multilevel interaction for one-stage visual grounding

Yue WANG, Jiabo YE, Xin LIN*()   

  1. 1. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
  • Received:2022-12-08 Online:2024-03-25 Published:2024-03-18
  • Contact: Xin LIN E-mail:xlin@cs.ecnu.edu.cn

Abstract:

This study explores the multimodal understanding and reasoning for one-stage visual grounding. Existing one-stage methods extract visual feature maps and textual features separately, and then, multimodal reasoning is performed to predict the bounding box of the referred object. These methods suffer from the following two weaknesses: Firstly, the pre-trained visual feature extractors introduce text-unrelated visual signals into the visual features that hinder multimodal interaction. Secondly, the reasoning process followed in these two methods lacks visual guidance for language modeling. It is clear from these shortcomings that the reasoning ability of existing one-stage methods is limited. We propose a low-level interaction to extract text-related visual feature maps, and a high-level interaction to incorporate visual features in guiding the language modeling and further performing multistep reasoning on visual features. Based on the proposed interactions, we present a novel network architecture called the dual-path multilevel interaction network (DPMIN). Furthermore, experiments on five commonly used visual grounding datasets are conducted. The results demonstrate the superior performance of the proposed method and its real-time applicability.

Key words: visual grounding, multimodal understanding, referring expressions

CLC Number: