华东师范大学学报(自然科学版) ›› 2020, Vol. 2020 ›› Issue (5): 113-130.doi: 10.3969/j.issn.1000-5641.202091006

• 数据语义抽取 • 上一篇    下一篇

基于远程监督的关系抽取技术

王嘉宁1, 何怡2, 朱仁煜1, 刘婷婷1, 高明1   

  1. 1. 华东师范大学 数据科学与工程学院, 上海 200062;
    2. 上海市大数据中心, 上海 200072
  • 收稿日期:2020-08-07 发布日期:2020-09-24
  • 通讯作者: 何怡,女,工程师,研究方向为数据运营、数据分析、用户画像及社会网络挖掘.E-mail:yhe01@shanghai.gov.cn E-mail:yhe01@shanghai.gov.cn
  • 基金资助:
    国家重点研发计划(2016YFB1000905); 国家自然科学基金(U1911203, U1811264, 61877018, 61672234, 61672384); 中央高校基本科研业务费专项资金; 上海市科技兴农推广项目(T20170303); 上海市核心数学与实践重点实验室资助项目(18dz2271000)

Relation extraction via distant supervision technology

WANG Jianing1, HE Yi2, ZHU Renyu1, LIU Tingting1, GAO Ming1   

  1. 1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China;
    2. Shanghai Municipal Big Data Center, Shanghai 200072, China
  • Received:2020-08-07 Published:2020-09-24

摘要: 关系抽取作为一种经典的自然语言处理任务, 广泛应用于知识图谱的构建与补全、知识库问答和文本摘要等领域, 旨在抽取目标实体对之间的语义关系. 为了能够高效地构建大规模监督语料, 基于远程监督的关系抽取方法被提出, 通过将文本与现有知识库进行对齐来实现自动标注. 然而由于过强的假设使得其面临诸多挑战, 从而吸引了研究者们的关注. 本文首先介绍远程监督关系抽取的概念和形式化描述, 其次从噪声、信息匮乏以及非均衡3个方面对比分析相关方法及其优缺点, 接着对评估数据集以及评测指标进行了解释和对比分析, 最后探讨了远程监督关系抽取面对的新的挑战以及未来发展趋势, 并在最后做出总结.

关键词: 关系抽取, 远程监督, 自然语言处理, 知识图谱, 噪声处理

Abstract: Relation extraction is one of the classic natural language processing tasks that has been widely used in knowledge graph construction and completion, knowledge base question answering, and text summarization. It aims to extract the semantic relation from a target entity pair. In order to construct a large-scale supervised corpus efficiently, a distant supervision method was proposed to realize automatic annotation by aligning the text with the existing knowledge base. However, it highlights a series of challenges as a result of over-strong assumptions and, accordingly, has attracted the attention of researchers. Firstly, this paper introduces the theories of distant supervision relation extraction and the corresponding formal descriptions. Secondly, we systematically analyze related methods and their respective pros and cons from three perspectives: noisy data, insufficient information, and data imbalance. Next, we explain and compare some benchmark corpus and evaluation metrics. Lastly, we highlight new subsequent challenges for distant supervision relation extraction and discuss trends and directions of future research before concluding.

Key words: relation extraction, distant supervision, natural language processing, knowledge graph, noise processing

中图分类号: