Journal of East China Normal University(Natural Science) ›› 2020, Vol. 2020 ›› Issue (5): 56-67.doi: 10.3969/j.issn.1000-5641.202091004

• Methodology and System of Machine Learning • Previous Articles     Next Articles

Enriching image descriptions by fusing fine-grained semantic features with a transformer

WANG Junhao, LUO Yifeng   

  1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
  • Received:2020-08-04 Published:2020-09-24

Abstract: Modern image captioning models following the encoder-decoder architecture of a convolutional neural network (CNN) or recurrent neural network (RNN) face the issue of dismissing a large amount of detailed information contained in images and the high cost of training time. In this paper, we propose a novel model, consisting of a compact bilinear encoder and a compact multi-modal decoder, to improve image captioning with fine-grained regional object features. In the encoder, compact bilinear pooling (CBP) is used to encode fine-grained semantic features from an image’s regional features and transformers are used to encode global semantic features from an image’s global bottom-up features; the collective encoded features are subsequently fused using a gate structure to form the overall encoded features of the image. In the decoding process, we extract multi-modal features from fine grained regional object features, and fuse them with overall encoded features to decode semantic information for description generation. Extensive experiments performed on the public Microsoft COCO dataset show that our model achieves state-of-the-art image captioning performance.

Key words: image captioning, fine-grained features, multi-modal features, transformer

CLC Number: