Journal of East China Normal University(Natural Science) >
Dual-path network with multilevel interaction for one-stage visual grounding
Received date: 2022-12-08
Online published: 2024-03-18
This study explores the multimodal understanding and reasoning for one-stage visual grounding. Existing one-stage methods extract visual feature maps and textual features separately, and then, multimodal reasoning is performed to predict the bounding box of the referred object. These methods suffer from the following two weaknesses: Firstly, the pre-trained visual feature extractors introduce text-unrelated visual signals into the visual features that hinder multimodal interaction. Secondly, the reasoning process followed in these two methods lacks visual guidance for language modeling. It is clear from these shortcomings that the reasoning ability of existing one-stage methods is limited. We propose a low-level interaction to extract text-related visual feature maps, and a high-level interaction to incorporate visual features in guiding the language modeling and further performing multistep reasoning on visual features. Based on the proposed interactions, we present a novel network architecture called the dual-path multilevel interaction network (DPMIN). Furthermore, experiments on five commonly used visual grounding datasets are conducted. The results demonstrate the superior performance of the proposed method and its real-time applicability.
Yue WANG , Jiabo YE , Xin LIN . Dual-path network with multilevel interaction for one-stage visual grounding[J]. Journal of East China Normal University(Natural Science), 2024 , 2024(2) : 65 -75 . DOI: 10.3969/j.issn.1000-5641.2024.02.008
1 | NIU Y, TANG K, ZHANG H, et al. Counterfactual VQA: A cause-effect look at language bias [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 12700-12710. |
2 | YANG Z, GAN Z, WANG J, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA [C] // Proceedings of the AAAI Conference on Artificial Intelligence. 2022: 3081-3089. |
3 | TU Z, WANG Y, BIRKBECK N, et al.. UGC-VQA: Benchmarking blind video quality assessment for user generated content. IEEE Transactions on Image Processing, 2021, 30, 4449- 4464. |
4 | SENEVIRATNE K L, MUNAWEERA I, PEIRIS S E, et al.. Recent progress in visible-light active (VLA) TiO2 nano-structures for enhanced photocatalytic activity (PCA) and antibacterial properties: A review. Iranian Journal of Catalysis, 2021, 11 (3): 217- 245. |
5 | TOBIN J J, OFFNER S S R, KRATTER K M, et al.. The VLA/ALMA nascent disk and multiplicity (VANDAM) survey of Orion protostars. V. A characterization of protostellar multiplicity. The Astrophysical Journal, 2022, 925, 39. |
6 | WEI M, CHEN L, JI W, et al. Rethinking the two-stage framework for grounded situation recognition [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2022: 2651-2658. |
7 | QIAO Y, DENG C, WU Q.. Referring expression comprehension: A survey of methods and datasets. IEEE Transactions on Multimedia, 2020, 23, 4426- 4440. |
8 | YANG S, LI G, YU Y. Dynamic graph attention for referring expression comprehension [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 4644-4653. |
9 | CHEN L, MA W, XIAO J, et al. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding [C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 1036-1044. |
10 | MU Z, TANG S, TAN J, et al. Disentangled motif-aware graph learning for phrase grounding [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2021: 13587-13594. |
11 | YANG Z, GONG B, WANG L, et al. A fast and accurate one-stage approach to visual grounding [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 4683-4693. |
12 | YANG Z, CHEN T, WANG L, et al. Improving one-stage visual grounding by recursive sub-query construction [C]// Computer Vision–ECCV 2020. 2020: 387-404. |
13 | DENG J, YANG Z, CHEN T, et al. TransVG: End-to-end visual grounding with transformers [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 1769-1779. |
14 | KAMATH A, SINGH M, LECUN Y, et al. MDETR-modulated detection for end-to-end multi-modal understanding [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 1780-1790. |
15 | YE J, LIN X, HE L, et al. One-stage visual grounding via semantic-aware feature filter [C]// Proceedings of the 29th ACM International Conference on Multimedia. 2021: 1702-1711. |
16 | KAZEMZADEH S, ORDONEZ V, MATTEN M, et al. ReferitGame: Referring to objects in photographs of natural scenes [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 787-798. |
17 | YU L, POIRSON P, YANG S, et al. Modeling context in referring expressions [C]// Computer Vision–ECCV 2016. 2016: 69-85. |
18 | MAO J, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 11-20. |
19 | PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models [C]// Proceedings of the IEEE International Conference on Computer Vision. 2015: 2641-2649. |
20 | YANG S, LI G, YU Y. Graph-structured referring expression reasoning in the wild [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 9952-9961. |
21 | REDMON J, FARHADI A. YOLOv3: An incremental improvement [EB/OL]. (2018-04-08) [2022-11-20]. https://doi.org/10.48550/arXiv.1804.02767. |
22 | DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24) [2022-11-22]. https://doi.org/10.48550/arXiv.1810.04805. |
23 | SHARMA P, DING N, GOODMAN S, et al. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 2556-2565. |
24 | WOO S, PARK J, LEE J Y, et al. Cbam: Convolutional block attention module [C]// Proceedings of the European Conference on Computer Vision. 2018: 3-19. |
25 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000–6010. |
26 | TIAN Z, CHU X, WANG X, et al.. Fully convolutional one-stage 3D object detection on lidar range images. Advances in Neural Information Processing Systems, 2022, 35, 34899- 34911. |
27 | GIRSHICK R. Fast R-CNN [C]// Proceedings of the IEEE International Conference on Computer Vision. 2015: 1440-1448. |
28 | REZATOFIGHI H, TSOI N, GWAK J Y, et al. Generalized intersection over union: A metric and a loss for bounding box regression [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 658-666. |
29 | LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context [C]// Computer Vision–ECCV 2014. 2014: 740-755. |
30 | LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization [EB/OL]. (2017-11-14) [2022-11-15]. https://doi.org/10.48550/arXiv.1711.05101. |
31 | REN S Q, HE K M , GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligenc, 2017, 39(6): 1137-1149. |
32 | NAGARAJA V K, MORARIU V I, DAVIS L S. Modeling context between objects for referring expression understanding [C]// Computer Vision–ECCV 2016. 2016: 792-807. |
33 | HU R, ROHRBACH M, ANDREAS J, et al. Modeling relationships in referential expressions with compositional modular networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1115-1124. |
34 | ZHUANG B, WU Q, SHEN C, et al. Parallel attention: A unified framework for visual object discovery through dialogs and queries [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4252-4261. |
35 | ZHANG H, NIU Y, CHANG S F. Grounding referring expressions in images by variational context [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4158-4166. |
36 | WANG P, WU Q, CAO J, et al. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 1960-1968. |
37 | YU L, TAN H, BANSAL M, et al. A joint speaker-listener-reinforcer model for referring expressions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 7282-7290. |
38 | CHEN X, MA L, CHEN J, et al. Real-time referring expression comprehension by single-stage grounding network [EB/OL]. (2018-12-09)[2022-11-18]. https://doi.org/10.48550/arXiv.1812.03426. |
/
〈 |
|
〉 |