1 |
NIU Y, TANG K, ZHANG H, et al. Counterfactual VQA: A cause-effect look at language bias [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 12700-12710.
2 |
YANG Z, GAN Z, WANG J, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA [C] // Proceedings of the AAAI Conference on Artificial Intelligence. 2022: 3081-3089.
3 |
TU Z, WANG Y, BIRKBECK N, et al.. UGC-VQA: Benchmarking blind video quality assessment for user generated content. IEEE Transactions on Image Processing, 2021, 30, 4449- 4464.
4 |
SENEVIRATNE K L, MUNAWEERA I, PEIRIS S E, et al.. Recent progress in visible-light active (VLA) TiO2 nano-structures for enhanced photocatalytic activity (PCA) and antibacterial properties: A review. Iranian Journal of Catalysis, 2021, 11 (3): 217- 245.
5 |
TOBIN J J, OFFNER S S R, KRATTER K M, et al.. The VLA/ALMA nascent disk and multiplicity (VANDAM) survey of Orion protostars. V. A characterization of protostellar multiplicity. The Astrophysical Journal, 2022, 925, 39.
6 |
WEI M, CHEN L, JI W, et al. Rethinking the two-stage framework for grounded situation recognition [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2022: 2651-2658.
7 |
QIAO Y, DENG C, WU Q.. Referring expression comprehension: A survey of methods and datasets. IEEE Transactions on Multimedia, 2020, 23, 4426- 4440.
8 |
YANG S, LI G, YU Y. Dynamic graph attention for referring expression comprehension [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 4644-4653.
9 |
CHEN L, MA W, XIAO J, et al. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding [C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 1036-1044.
10 |
MU Z, TANG S, TAN J, et al. Disentangled motif-aware graph learning for phrase grounding [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2021: 13587-13594.
11 |
YANG Z, GONG B, WANG L, et al. A fast and accurate one-stage approach to visual grounding [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 4683-4693.
12 |
YANG Z, CHEN T, WANG L, et al. Improving one-stage visual grounding by recursive sub-query construction [C]// Computer Vision–ECCV 2020. 2020: 387-404.
13 |
DENG J, YANG Z, CHEN T, et al. TransVG: End-to-end visual grounding with transformers [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 1769-1779.
14 |
KAMATH A, SINGH M, LECUN Y, et al. MDETR-modulated detection for end-to-end multi-modal understanding [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 1780-1790.
15 |
YE J, LIN X, HE L, et al. One-stage visual grounding via semantic-aware feature filter [C]// Proceedings of the 29th ACM International Conference on Multimedia. 2021: 1702-1711.
16 |
KAZEMZADEH S, ORDONEZ V, MATTEN M, et al. ReferitGame: Referring to objects in photographs of natural scenes [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 787-798.
17 |
YU L, POIRSON P, YANG S, et al. Modeling context in referring expressions [C]// Computer Vision–ECCV 2016. 2016: 69-85.
18 |
MAO J, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 11-20.
19 |
PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models [C]// Proceedings of the IEEE International Conference on Computer Vision. 2015: 2641-2649.
20 |
YANG S, LI G, YU Y. Graph-structured referring expression reasoning in the wild [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 9952-9961.
21 |
REDMON J, FARHADI A. YOLOv3: An incremental improvement [EB/OL]. (2018-04-08) [2022-11-20]. https://doi.org/10.48550/arXiv.1804.02767.
22 |
DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24) [2022-11-22]. https://doi.org/10.48550/arXiv.1810.04805.
23 |
SHARMA P, DING N, GOODMAN S, et al. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 2556-2565.
24 |
WOO S, PARK J, LEE J Y, et al. Cbam: Convolutional block attention module [C]// Proceedings of the European Conference on Computer Vision. 2018: 3-19.
25 |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000–6010.
26 |
TIAN Z, CHU X, WANG X, et al.. Fully convolutional one-stage 3D object detection on lidar range images. Advances in Neural Information Processing Systems, 2022, 35, 34899- 34911.
27 |
GIRSHICK R. Fast R-CNN [C]// Proceedings of the IEEE International Conference on Computer Vision. 2015: 1440-1448.
28 |
REZATOFIGHI H, TSOI N, GWAK J Y, et al. Generalized intersection over union: A metric and a loss for bounding box regression [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 658-666.
29 |
LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context [C]// Computer Vision–ECCV 2014. 2014: 740-755.
30 |
LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization [EB/OL]. (2017-11-14) [2022-11-15]. https://doi.org/10.48550/arXiv.1711.05101.
31 |
REN S Q, HE K M , GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligenc, 2017, 39(6): 1137-1149.
32 |
NAGARAJA V K, MORARIU V I, DAVIS L S. Modeling context between objects for referring expression understanding [C]// Computer Vision–ECCV 2016. 2016: 792-807.
33 |
HU R, ROHRBACH M, ANDREAS J, et al. Modeling relationships in referential expressions with compositional modular networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1115-1124.
34 |
ZHUANG B, WU Q, SHEN C, et al. Parallel attention: A unified framework for visual object discovery through dialogs and queries [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4252-4261.
35 |
ZHANG H, NIU Y, CHANG S F. Grounding referring expressions in images by variational context [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4158-4166.
36 |
WANG P, WU Q, CAO J, et al. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 1960-1968.
37 |
YU L, TAN H, BANSAL M, et al. A joint speaker-listener-reinforcer model for referring expressions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 7282-7290.
38 |
CHEN X, MA L, CHEN J, et al. Real-time referring expression comprehension by single-stage grounding network [EB/OL]. (2018-12-09)[2022-11-18]. https://doi.org/10.48550/arXiv.1812.03426.