华东师范大学学报(自然科学版) ›› 2020, Vol. 2020 ›› Issue (5): 56-67.doi: 10.3969/j.issn.1000-5641.202091004

• 机器学习方法与系统 • 上一篇    下一篇

通过细粒度的语义特征与Transformer丰富图像描述

王俊豪, 罗轶凤   

  1. 华东师范大学 数据科学与工程学院, 上海 200062
  • 收稿日期:2020-08-04 发布日期:2020-09-24
  • 通讯作者: 罗轶凤,男,副教授,硕士生导师,研究方向为文本数据挖掘与知识图谱.E-mail:yifluo@dase.ecnu.edu.cn E-mail:yifluo@dase.ecnu.edu.cn
  • 基金资助:
    国家重点研发计划(2018YFC0831904)

Enriching image descriptions by fusing fine-grained semantic features with a transformer

WANG Junhao, LUO Yifeng   

  1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
  • Received:2020-08-04 Published:2020-09-24

摘要: 传统的图像描述模型通常基于使用卷积神经网络(Convolutional Neural Network, CNN)和循环神经网络(Recurrent Neural Network, RNN)的编码器-解码器结构, 面临着遗失大量图像细节信息以及训练时间成本过高的问题. 提出了一个新颖的模型, 该模型包含紧凑的双线性编码器(Compact Bilinear Encoder)和紧凑的多模态解码器(Compact Multi-modal Decoder), 可通过细粒度的区域目标实体特征来改善图像描述. 在编码器中, 紧凑的双线性池化(Compact Bilinear Pooling, CBP)用于编码细粒度的语义图像区域特征, 该模块使用多层Transformer编码图像全局语义特征, 并将所有编码的特征通过门结构融合在一起, 作为图像的整体编码特征. 在解码器中, 从细粒度的区域目标实体特征和目标实体类别特征中提取多模态特征, 并将其与整体编码后的特征融合用于解码语义信息生成描述. 该模型在Microsoft COCO公开数据集上进行了广泛的实验, 实验结果显示, 与现有的模型相比, 该模型取得了更好的图像描述效果.

关键词: 图像描述, 精细化特征, 多模态特征, Transformer

Abstract: Modern image captioning models following the encoder-decoder architecture of a convolutional neural network (CNN) or recurrent neural network (RNN) face the issue of dismissing a large amount of detailed information contained in images and the high cost of training time. In this paper, we propose a novel model, consisting of a compact bilinear encoder and a compact multi-modal decoder, to improve image captioning with fine-grained regional object features. In the encoder, compact bilinear pooling (CBP) is used to encode fine-grained semantic features from an image’s regional features and transformers are used to encode global semantic features from an image’s global bottom-up features; the collective encoded features are subsequently fused using a gate structure to form the overall encoded features of the image. In the decoding process, we extract multi-modal features from fine grained regional object features, and fuse them with overall encoded features to decode semantic information for description generation. Extensive experiments performed on the public Microsoft COCO dataset show that our model achieves state-of-the-art image captioning performance.

Key words: image captioning, fine-grained features, multi-modal features, transformer

中图分类号: