基于解耦常识性关联的图像描述生成算法

doi:10.3969/j.issn.1000-5641.2024.02.014

摘要/Abstract

摘要：

基于解耦常识性关联的图像描述生成算法旨在排除各类实体间常识性关联对模型推理的干扰, 提高描述生成的流畅性与准确性. 针对当前图像描述生成中存在的符合常识但与图像内容不相符的关系语句, 该算法先通过一种新颖的训练方式加强关系检测模型对图像中真实关系的关注程度, 提高关系推理的准确性. 再通过一种关系感知的实体交互方法, 对存在关系的实体进行有针对性的信息交互, 对关系信息进行强化. 实验表明, 该算法能够纠正一些常识性的虚假关系, 生成较为准确的图像描述, 并在各项评价指标上获得了较好的实验结果.

关键词: 图像描述生成, 解耦常识性关联, 注意力机制

Abstract:

The image caption generation algorithm based on decoupling commonsense association aims to eliminate the interference of commonsense association between various types of entities on the model reasoning, and improve the fluency and accuracy of the generated description. Aiming at the relationship sentences in the current image description that conform to common sense but do not conform to the image content, the algorithm first uses a novel training method to improve the attention of the relationship detection model to the real relationship in the image and improve the accuracy of relationship reasoning. Then, a relation-aware entity interaction method was used to carry out targeted information interaction for entities with relationships, and the relationship information was strengthened. The experimental results show that the proposed algorithm can correct some commonsense false relationships, generate more accurate image captions, and obtain better experimental results on various evaluation indicators.

Key words: image captioning, decoupling commonsense association, attention

中图分类号:

TP399

刘家伟, 林欣. 基于解耦常识性关联的图像描述生成算法[J]. 华东师范大学学报（自然科学版）, 2024, 2024(2): 131-142.

Jiawei LIU, Xin LIN. An image caption generation algorithm based on decoupling commonsense association[J]. Journal of East China Normal University(Natural Science), 2024, 2024(2): 131-142.

图/表 10

图1

图2

图3

表1

表2

表3

图4

表4

图5

表5

参考文献 41

1	STEFANINI M, CORNIA M, BARALDI L, et al.. From show to tell: A survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45 (1): 539- 559.
2	SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks [C]// Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2. 2014: 3104-3112.
3	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3156-3164.
4	ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6077-6086.
5	LI G, ZHU L, LIU P, et al. Entangled transformer for image captioning [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 8928-8937.
6	HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: Transforming objects into words [J]. Advances in Neural Information Processing Systems, 2019: 14733-14745.
7	XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention [C]// International Conference on Machine Learning. 2015: 2048-2057.
8	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
9	HUANG L, WANG W, CHEN J, et al. Attention on attention for image captioning [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 4634-4643.
10	CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 10578-10587.
11	PAN Y W, YAO T, LI Y H, et al. X-Linear attention networks for image captioning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 10971-10980.
12	ZHANG X, SUN X, LUO Y, et al. RSTNet: Captioning with adaptive attention on visual and non-visual words [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 15465-15474.
13	ZENG P P, ZHANG H N, SONG J K, et al. ${\cal{S}}^2 $ transformer for image captioning [C]// Proceedings of the International Joint Conferences on Artificial Intelligence. 2022: 12675-12684.
14	YANG X, TANG K, ZHANG H, et al. Autoencoding scene graphs for image captioning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 10685-10694.
15	LUO Y, JI J, SUN X, et al. Dual-level collaborative transformer for image captioning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2021: 2286-2293.
16	LU J, XIONG C, PARIKH D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 375-383.
17	HU R H, SINGH A, DARRELL T, et al. Iterative answer prediction with pointer augmented multimodal transformers for textVQA [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 9992-10002.
18	GAO J, MENG X, WANG S, et al. Masked non-autoregressive image captioning [EB/OL]. (2019-06-03)[2023-01-03]. https://arxiv.org/abs/1906.00717.pdf.
19	FEI Z. Iterative back modification for faster image captioning [C]// Proceedings of the 28th ACM International Conference on Multimedia. 2020: 3182-3190.
20	RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 7008-7024.
21	FENG Y, MA L, LIU W, et al. Unsupervised image captioning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 4125-4134.
22	GU J, JOTY S, CAI J, et al. Unpaired image captioning via scene graph alignments [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 10323-10332.
23	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]// Proceedings of the 38th International Conference on Machine Learning. 2021: 8748-8763.
24	JOHNSON J, KARPATHY A, LI F F. DenseCap: Fully convolutional localization networks for dense captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4565-4574.
25	YANG L, TANG K, YANG J, et al. Dense captioning with joint inference and visual context [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2193-2202.
26	LI X, JIANG S, HAN J. Learning object context for dense captioning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2019: 8650-8657.
27	CORNIA M, BARALDI L, CUCCHIARA R. Show, control and tell: A framework for generating controllable and grounded captions [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 8307-8316.
28	ZHENG Y, LI Y, WANG S. Intention oriented image captions with guiding objects [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 8395-8404.
29	CHEN S, JIN Q, WANG P, et al. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 9962-9971.
30	CHEN L, JIANG Z, XIAO J, et al. Human-like controllable image captioning with verb-specific semantic roles [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 16846-16856.
31	LI X R, XU C X, WANG X X, et al.. COCO-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Transactions on Multimedia, 2019, 21 (9): 2347- 2360.
32	LAN W, LI X, DONG J. Fluency-guided cross-lingual image captioning [C]// Proceedings of the 25th ACM International Conference on Multimedia. 2017: 1549-1557.
33	SONG Y, CHEN S, ZHAO Y, et al. Unpaired cross-lingual image caption generation with self-supervised rewards [C]// Proceedings of the 27th ACM International Conference on Multimedia. 2019: 784-792.
34	GIRSHICK R. Fast R-CNN [C]// Proceedings of the IEEE International Conference on Computer Vision. 2015: 1440-1448.
35	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [C]// Computer Vision-ECCV 2014. 2014: 740-755.
36	KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3128-3137.
37	PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation [C]// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002: 311-318.
38	BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments [C]// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005: 65-72.
39	LIN C Y. ROUGE: A package for automatic evaluation of summaries [C]// Proceedings of the Workshop on Text Summarization Branches Out. 2004: 74-81.
40	VEDANTAM R, LAWRENCE C, PARIKH D. CIDEr: Consensus-based image description evaluation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 4566-4575.
41	ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE: Semantic propositional image caption evaluation [C]// Computer Vision-ECCV 2016. 2016: 382-398.

实验环境	环境配置
操作系统	Ubuntu 18.04.6
CPU (central processing unit)	Intel Xeon Gold 6226R
GPU (graphics processing unit)	RTX 3090 24 GB
内存	64 GB
编程语言	Python 3.8
深度学习框架	Pytorch 1.11

模型	BLEU-1	BLEU-4	METEOR	ROUGE	CIDEr	SPICE
UP-Down	79.8	36.3	27.7	56.9	120.1	21.4
SGAE	80.8	38.4	28.4	58.6	127.8	22.1
AoANet	80.8	39.1	29.1	59.1	130.3	22.7
$ {\cal{M}}^2 $-Transformer	80.8	39.1	29.2	58.6	131.2	22.6
X-Transformer	80.9	39.7	29.5	59.1	132.8	23.4
RSNet	81.1	39.3	29.4	58.8	133.3	23.0
${\cal{S}}^2 $ Transformer	81.1	39.6	29.6	59.1	133.5	23.2
本文所提出的模型	81.4	39.6	29.7	59.3	133.3	22.8

模型	BLEU-1	BLEU-4	METEOR	ROUGE	CIDEr	SPICE
Base	80.3	38.1	28.8	58.0	129.6	22.2
Base + Ori + Att	80.4	38.4	28.8	58.3	129.9	22.3
Base + Ace + Att	80.7	38.9	29.3	59.2	133.0	22.7
Base + Ori + Ers	80.9	39.0	29.2	58.8	132.7	22.5
Base + Ace + Ers	81.4	39.6	29.7	59.3	133.3	22.8

模型	一致性	完整性	流畅性	细致性
人工标注	4.54	4.20	4.61	3.72
Transformer	4.20	4.06	4.33	3.44
本文所提出的模型	4.33	4.17	4.47	3.56

样品	模型	一致性	完整性	流畅性	细致性
图5(a)	人工标注	4.1	4.7	4.6	4.0
	Transformer	3.0	4.5	4.8	4.1
	本文所提出的模型	4.2	4.8	4.8	4.4
图5(b)	人工标注	4.6	4.6	4.3	3.1
	Transformer	3.6	4.2	4.1	3.3
	本文所提出的模型	4.7	4.6	4.0	4.0
图5(c)	人工标注	4.7	4.2	3.9	3.2
	Transformer	2.9	3.8	4.5	2.3
	本文所提出的模型	4.3	4.0	4.6	3.1