Enriching image descriptions by fusing fine-grained semantic features with a transformer

doi:10.3969/j.issn.1000-5641.202091004

Abstract

Abstract: Modern image captioning models following the encoder-decoder architecture of a convolutional neural network (CNN) or recurrent neural network (RNN) face the issue of dismissing a large amount of detailed information contained in images and the high cost of training time. In this paper, we propose a novel model, consisting of a compact bilinear encoder and a compact multi-modal decoder, to improve image captioning with fine-grained regional object features. In the encoder, compact bilinear pooling (CBP) is used to encode fine-grained semantic features from an image’s regional features and transformers are used to encode global semantic features from an image’s global bottom-up features; the collective encoded features are subsequently fused using a gate structure to form the overall encoded features of the image. In the decoding process, we extract multi-modal features from fine grained regional object features, and fuse them with overall encoded features to decode semantic information for description generation. Extensive experiments performed on the public Microsoft COCO dataset show that our model achieves state-of-the-art image captioning performance.

Key words: image captioning, fine-grained features, multi-modal features, transformer

CLC Number:

TP399

WANG Junhao, LUO Yifeng. Enriching image descriptions by fusing fine-grained semantic features with a transformer[J]. Journal of East China Normal University(Natural Science), 2020, 2020(5): 56-67.

References

[1] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 652-663.
[2] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2017: 6000–6010.
[3] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 770-778.
[4] KULKARNI G, PREMRAJ V, ORDONEZ V, et al. Babytalk: Understanding and generating simple image descriptions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2891-2903.
[5] MITCHELL M, HAN X F, DODGE J, et al. Midge: Generating image descriptions from computer vision detections [C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. ACM, 2012: 747-756.
[6] YANG Y Z, TEO C L, DAUMÉ H, et al. Corpus-guided sentence generation of natural images [C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. ACM, 2011: 444-454.
[7] DEVLIN J, CHENG H, FANG H, et al. Language models for image captioning: The quirks and what works [EB/OL].(2015-10-14)[2020-06-30]. https://arxiv.org/pdf/1505.01809.pdf.
[8] FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: Generating sentences from images [C]//Computer Vision – ECCV 2010, Lecture Notes in Computer Science, vol 6314. Berlin: Springer, 2010: 15-29.
[9] KARPATHY A, JOULIN A, L F F. Deep fragment embeddings for bidirectional image sentence mapping [C]//Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. Cambridge, MA: MIT Press, 2014: 1889-1897.
[10] MAO J H, XU W, YANG Y, et al. Explain images with multimodal recurrent neural networks [EB/OL]. (2014-10-04)[2020-06-30]. https://arxiv.org/pdf/1410.1090.pdf.
[11] LU J S, XIONG C M, PARIKH D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 3242-3250.
[12] YAO T, PAN Y W, LI Y H, et al. Exploring visual relationship for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11218. Cham: Springer, 2018: 711-727.
[13] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 3156-3164.
[14] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 1-9.
[15] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
[16] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [EB/OL].(2014-09-03)[2020-06-30]. https://arxiv.org/pdf/1406.1078.pdf.
[17] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676.
[18] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [EB/OL]. (2015-04-10)[2020-6-30]. https://arxiv.org/pdf/1409.1556.pdf.
[19] XU K, BA J L, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention [EB/OL].(2016-04-19)[2020-6-30]. https://arxiv.org/pdf/1502.03044.pdf.
[20] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 6077-6086.
[21] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [C]//Advances in Neural Information Processing Systems 28 (NIPS 2015).[S.l.]: Curran Associates, Inc., 2015: 91-99.
[22] LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear CNN models for fine-grained visual recognition [C]//2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015: 1449-1457.
[23] GAO Y, BEIJBOM O, ZHANG N, et al. Compact bilinear pooling [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 317-326.
[24] KONG S, FOWLKES C. Low-rank bilinear pooling for fine-grained classification [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 7025-7034.
[25] WEI X, ZHANG Y, GONG Y H, et al. Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11207. Cham: Springer, 2018: 365-380.
[26] CHARIKAR M, CHEN K, FARACH-COLTON M. Finding frequent items in data streams [C]//Automata, Languages and Programming, ICALP 2002, Lecture Notes in Computer Science, vol 2380. Berlin : Springer, 2002: 693-703.
[27] PHAM N, PAGH R. Fast and scalable polynomial kernels via explicit feature maps [C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2013: 239-247.
[28] BA J L, KIROS J R, HINTON G E. Layer normalization [EB/OL].(2016-07-21)[2020-6-30]. https://arxiv.org/pdf/1607.06450.pdf.
[29] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 1179-1195,
[30] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [C]//Computer Vision – ECCV 2014, Lecture Notes in Computer Science, vol 8693. Cham: Springer, 2014: 740-755.
[31] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24)[2020-06-30].https://arxiv.org/pdf/1810.04805.pdf.
[32] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation [C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2002: 311-318.
[33] BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments [C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2005: 65-72.
[34] LIN C Y. Rouge: A package for automatic evaluation of summaries [C]//Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2004: 74-81.
[35] VEDANTAM R, ZITNICK C L, PARIKH D. Cider: Consensus-based image description evaluation [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 4566-4575.
[36] YAO T, PAN Y W, LI Y H, et al. Boosting image captioning with attributes [C]//2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 4904-4912.
[37] FANG H, GUPTA S, IANDOLA F, et al. From captions to visual concepts and back [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 1473-1482.
[38] JIANG W H, MA L, JIANG Y G, et al. Recurrent fusion network for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11206. Cham: Springer, 2018: 510-526.
[39] YAO T, PAN Y W, LI Y H, et al. Exploring visual relationship for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11218. Cham: Springer, 2018: 711-727.
[40] YANG X, TANG K H, ZHANG H W, et al. Auto-encoding scene graphs for image captioning [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019: 10677-10686.