针对命名实体识别的数据增强技术

马晓琴; 郭小鹤; 薛峪峰; 杨琳; 陈远哲

doi:10.3969/j.issn.1000-5641.2021.05.002

华东师范大学学报（自然科学版） >

2021 , Vol. 2021 >Issue 5: 14 - 23

DOI: https://doi.org/10.3969/j.issn.1000-5641.2021.05.002

金融知识图谱

针对命名实体识别的数据增强技术

马晓琴 ,
郭小鹤 ,
薛峪峰 ,
杨琳 ,
陈远哲

展开

1. 国网青海省电力公司信息通信公司, 西宁　810008
2. 上海计算机软件技术开发中心, 上海　201112
3. 华东师范大学数据科学与工程学院, 上海　200062

马晓琴，女，高级工程师，研究方向为用电信息系统检修维护. E-mail: xqm8651@126.com

收稿日期: 2021-08-24

网络出版日期: 2021-09-28

基金资助

国家自然科学基金 (U1911203, U1811264, 61877018, 61672234, 61672384); 中央高校基本科研业务费专项; 上海市核心数学与实践重点实验室资助项目 (18dz2271000)

收起

Data augmentation technology for named entity recognition

Xiaoqin MA ,
Xiaohe GUO ,
Yufeng XUE ,
Lin YANG ,
Yuanzhe CHEN

Expand

1. Information and Communication Company, State Grid Qinghai Electric Power Company, Xining　810008, China
2. Shanghai Development Center of Computer Software Technology, Shanghai　201112, China
3. School of Data Science and Engineering, East China Normal University, Shanghai　200062, China

Received date: 2021-08-24

Online published: 2021-09-28

Fold

摘要

近年来, 深度学习方法被广泛地应用于命名实体识别任务中, 并取得了良好的效果. 但是主流的命名实体识别都是基于序列标注的方法, 这类方法依赖于足够的高质量标注语料. 然而序列数据的标注成本高昂, 导致命名实体识别训练集规模往往较小, 这严重地限制了命名实体识别模型的最终性能. 为了在不增加人工成本的前提下扩大命名实体识别的训练集规模, 本文分别提出了基于EDA(Easy Data Augmentation)、基于远程监督、基于Bootstrap(自展法)的命名实体识别数据增强技术. 通过在本文给出的FIND-2019数据集上进行的实验表明, 这几种数据增强技术及其它们的组合能够低成本地增加训练集的规模, 从而显著地提升命名实体识别模型的性能.

关键词： 命名实体识别; 数据增强; EDA; 远程监督; Bootstrap

本文引用格式

马晓琴 , 郭小鹤 , 薛峪峰 , 杨琳 , 陈远哲 . 针对命名实体识别的数据增强技术[J]. 华东师范大学学报（自然科学版）, 2021 , 2021(5) : 14 -23 . DOI: 10.3969/j.issn.1000-5641.2021.05.002

Abstract

A named entity recognition task is as a task that involves extracting instances of a named entity from continuous natural language text. Named entity recognition plays an important role in information extraction and is closely related to other information extraction tasks. In recent years, deep learning methods have been widely used in named entity recognition tasks; the methods, in fact, have achieved a good performance level. The most common named entity recognition models use sequence tagging, which relies on the availability of a high quality annotation corpus. However, the annotation cost of sequence data is high; this leads to the use of small training sets and, in turn, seriously limits the final performance of named entity recognition models. To enlarge the size of training sets for named entity recognition without increasing the associated labor cost, this paper proposes a data augmentation method for named entity recognition based on EDA, distant supervision, and bootstrap. Using experiments on the FIND-2019 dataset, this paper illustrates that the proposed data augmentation techniques and combinations thereof can significantly improve the overall performance of named entity recognition models.

Key words： named entity recognition; data augmentation; EDA; distant supervision; Bootstrap

参考文献

1	PARK D S, CHAN W, ZHANG Y, et al. Specaugment: A simple data augmentationmethod for automatic speech recognition [EB/OL]. (2019-12-03)[2021-08-24]. https://arxiv.org/abs/1904.08779.
2	WEI J W, ZOU K. Eda: Easy data augmentation techniques for boosting perfor-mance on text classification tasks [EB/OL]. (2019-08-25)[2021-08-24]. https://arxiv.org/pdf/1901.11196.pdf.
3	WEISCHEDEL R. BEN: Description of the PLUM system as used for MUC-6 [C]// Proceedings of the 6th Conference on Message Understanding. 1995: 55-69.
4	ABERDEEN J, BURGER J, CONNOLLY D, et al. MITRE-Bedford: Description of the ALEMBIC system as used for MUC-4 [C]// Proceedings of the 4th Conference on Message Understanding. 1992: 215-222.
5	HOBBS J R, BEAR J, ISRAEL D, et al. SRI international fastus system MUC-6 test results and analysis [C]// Proceedings of the 6th Conference on Message Understanding. 1995.
6	MAYFIELD J, MCNAMEE P, PIATKO C. Named entity recognition using hundreds of thousands of features [C]// Proceedings of the Seventh Conference on Natural Language Learning. 2003: 184-187.
7	RABINERLR, JUANGB-H. An introduction to hidden Markov models. IEEE Assp Magazine, 1986, 3 (1): 4- 16.
8	LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields: Probabilis-tic models for segmenting and labeling sequence data [C]// Proceedings of the Eighteenth International Conference on Machine Learning. 2001: 282-289.
9	STRUBELL E, VERGA P, BELANGER D, et al. Fast and accurate entity recognitionwith iterated dilated convolutions [EB/OL]. (2017-07-22)[2021-08-24]. https://arxiv.org/pdf/1702.02098.pdf.
10	HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging [EB/OL]. (2015-08-09)[2021-08-24]. https://arxiv.org/pdf/1508.01991.pdf.
11	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
12	YAN H, DENG B, LI X, et al. TENER: Adapting transformer encoder for named entity recognition [EB/OL]. (2019-12-10)[2021-08-24]. https://arxiv.org/abs/1911.04474v2.
13	CHIU J P, NICHOLS E. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 2016, (4): 357- 370.
14	CETOLI A, BRAGAGLIA S, O’HARNEY A D, et al. Graph convolutional networks for named entity recognition [EB/OL]. (2018-02-14)[2021-08-24]. https://arxiv.org/pdf/1709.10053.pdf.
15	ZHANG Y, YANG J. Chinese NER using lattice LSTM [EB/OL]. (2018-07-05)[2021-08-24]. https://arxiv.org/pdf/1805.02023.pdf.
16	MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality [C]// Preceedings of ACL. 2013: 3111-3119.
17	PENNINGTON J, SOCHER R, MANNING C. Glove: Global vectors for word representation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543.
18	BOJANOWSKI P, GRAVE E, JOULIN A, et al. Enriching word vectors with sub- word information. Transactions of the Association for Computational Linguistics, 2017, (5): 135- 146.
19	PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations [EB/OL]. (2018-03-22)[2021-09-01]. https://www.researchgate.net/publication/323217640_Deep_contextualized_word_representations.
20	AKBIK A, BLYTHE D, VOLLGRAF R. Contextual string embeddings for sequence labeling [C]// Proceedings of the 27th International Conference on Computational Linguistics. 2018: 1638-1649.
21	DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24)[2021-08-24]. https://arxiv.org/pdf/1810.04805.pdf.
22	RADFORD A. Language models are unsupervised multitask learners [EB/OL]. (2019-02-19)[2021-09-01]. https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
23	BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners [EB/OL]. (2020-07-22)[2021-08-24]. https://arxiv.org/abs/2005.14165v2.
24	GUO H, MAO Y, ZHANG R. Augmenting data with mixup for sentence classification: An empirical study [EB/OL]. (2019-05-22)[2021-08-24]. https://arxiv.org/abs/1905.08941.
25	LUQUE F M. Atalaya at TASS 2019: Data augmentation and robust embeddings for sentiment analysis [EB/OL]. (2019-09-25)[2021-08-24]. https://arxiv.org/abs/1909.11241.
26	DAI X, ADEL H. An analysis of simple data augmentation for named entity recognition [EB/OL]. (2020-10-22)[2021-08-24]. https://arxiv.org/abs/2010.11683.
27	CHEN J, WANG Z, TIAN R, et al. Local additivity based data augmentation for semi-supervised NER [EB/OL]. (2020-10-04)[2021-08-24]. https://arxiv.org/abs/2010.01677.
28	KERAGHEL A, BENABDESLEM K, CANITIA B. Data augmentation process to improve deep learning-based NER task in the automotive industry field [C]//2020 International Joint Conference on Neural Networks (IJCNN). 2020: 1-8.
29	LOSHCHILOV I, HUTTER F. Fixing weight decay regularization in adam [EB/OL]. (2019-01-04)[2021-08-24]. https://arxiv.org/abs/1711.05101v1.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献