华东师范大学学报(自然科学版) ›› 2021, Vol. 2021 ›› Issue (5): 14-23.doi: 10.3969/j.issn.1000-5641.2021.05.002

• 金融知识图谱 • 上一篇    下一篇

针对命名实体识别的数据增强技术

马晓琴1(), 郭小鹤1, 薛峪峰1, 杨琳2,*(), 陈远哲3   

  1. 1. 国网青海省电力公司 信息通信公司, 西宁 810008
    2. 上海计算机软件技术开发中心, 上海 201112
    3. 华东师范大学 数据科学与工程学院, 上海 200062
  • 收稿日期:2021-08-24 出版日期:2021-09-25 发布日期:2021-09-28
  • 通讯作者: 杨琳 E-mail:xqm8651@126.com;yangl@sscenter.sh.cn
  • 作者简介:马晓琴,女,高级工程师,研究方向为用电信息系统检修维护. E-mail: xqm8651@126.com
  • 基金资助:
    国家自然科学基金 (U1911203, U1811264, 61877018, 61672234, 61672384); 中央高校基本科研业务费专项; 上海市核心数学与实践重点实验室资助项目 (18dz2271000)

Data augmentation technology for named entity recognition

Xiaoqin MA1(), Xiaohe GUO1, Yufeng XUE1, Lin YANG2,*(), Yuanzhe CHEN3   

  1. 1. Information and Communication Company, State Grid Qinghai Electric Power Company, Xining 810008, China
    2. Shanghai Development Center of Computer Software Technology, Shanghai 201112, China
    3. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
  • Received:2021-08-24 Online:2021-09-25 Published:2021-09-28
  • Contact: Lin YANG E-mail:xqm8651@126.com;yangl@sscenter.sh.cn

摘要:

近年来, 深度学习方法被广泛地应用于命名实体识别任务中, 并取得了良好的效果. 但是主流的命名实体识别都是基于序列标注的方法, 这类方法依赖于足够的高质量标注语料. 然而序列数据的标注成本高昂, 导致命名实体识别训练集规模往往较小, 这严重地限制了命名实体识别模型的最终性能. 为了在不增加人工成本的前提下扩大命名实体识别的训练集规模, 本文分别提出了基于EDA(Easy Data Augmentation)、基于远程监督、基于Bootstrap(自展法)的命名实体识别数据增强技术. 通过在本文给出的FIND-2019数据集上进行的实验表明, 这几种数据增强技术及其它们的组合能够低成本地增加训练集的规模, 从而显著地提升命名实体识别模型的性能.

关键词: 命名实体识别, 数据增强, EDA, 远程监督, Bootstrap

Abstract:

A named entity recognition task is as a task that involves extracting instances of a named entity from continuous natural language text. Named entity recognition plays an important role in information extraction and is closely related to other information extraction tasks. In recent years, deep learning methods have been widely used in named entity recognition tasks; the methods, in fact, have achieved a good performance level. The most common named entity recognition models use sequence tagging, which relies on the availability of a high quality annotation corpus. However, the annotation cost of sequence data is high; this leads to the use of small training sets and, in turn, seriously limits the final performance of named entity recognition models. To enlarge the size of training sets for named entity recognition without increasing the associated labor cost, this paper proposes a data augmentation method for named entity recognition based on EDA, distant supervision, and bootstrap. Using experiments on the FIND-2019 dataset, this paper illustrates that the proposed data augmentation techniques and combinations thereof can significantly improve the overall performance of named entity recognition models.

Key words: named entity recognition, data augmentation, EDA, distant supervision, Bootstrap

中图分类号: