华东师范大学学报(自然科学版) ›› 2022, Vol. 2022 ›› Issue (6): 79-86.doi: 10.3969/j.issn.1000-5641.2022.06.009

• 计算机科学 • 上一篇    下一篇

基于影响函数的远程监督关系抽取

黄子寅, 吴苑斌*()   

  1. 华东师范大学 计算机科学与技术学院, 上海 200062
  • 收稿日期:2021-08-13 出版日期:2022-11-25 发布日期:2022-11-22
  • 通讯作者: 吴苑斌 E-mail:ybwu@cs.ecnu.edu.cn

Distant supervision relation extraction via the influence function

Ziyin HUANG, Yuanbin WU*()   

  1. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
  • Received:2021-08-13 Online:2022-11-25 Published:2022-11-22
  • Contact: Yuanbin WU E-mail:ybwu@cs.ecnu.edu.cn

摘要:

远程监督的标注方法在关系抽取任务中被广泛应用, 其在减小人工标注负担的同时, 也引入了大量噪声样本, 影响了模型的训练. 针对这个问题, 提出了一种基于影响函数的去噪方法. 通过影响函数衡量训练集中每一条训练样本对模型预测的影响, 并在建立噪声样本与其的关系后, 设计了1个判断样本是否是错误标注的打分函数. 基于此打分函数的值, 从初始的1个小的无噪声集合出发, 通过自举法迭代式地获得最后的去噪数据集. 该去噪方法作为一种对数据的前处理方法, 在公开数据集上取得了良好的效果.

关键词: 远程监督, 关系抽取, 影响函数, 自举法

Abstract:

Distant supervision relation extraction captures noisy instances while reducing the burden of manual annotation, which hinders the training and testing process. To alleviate this problem, we proposed a de-noising method based on the influence function. The influence function measures the influence of each training point; the influence of one training point is defined as the change in test loss after removing the training point. We observed that this property could be used to determine whether a training instance involves noisy data. First, we designed a scoring function based on the influence function. Then, we integrated the scoring function into a bootstrapping framework to obtain the final denoising dataset from a small clean set. Using this preprocessing method, every distantly supervised dataset could be denoised by our method. Experimental results showed that the proposed denoised dataset can achieve good performance on a public dataset.

Key words: distant supervision, relation extraction, influence function, bootstrapping

中图分类号: