华东师范大学学报(自然科学版) ›› 2021, Vol. 2021 ›› Issue (6): 147-160.doi: 10.3969/j.issn.1000-5641.2021.06.015

• 计算机科学 • 上一篇    

基于异构网络的无监督作者名称消歧

郭晨亮1, 林欣1,*(), 殷玥2   

  1. 1. 华东师范大学 计算机科学与技术学院, 上海 200062
    2. 上海科技发展有限公司, 上海 200031
  • 收稿日期:2020-09-18 出版日期:2021-11-25 发布日期:2021-11-26
  • 通讯作者: 林欣 E-mail:xlin@cs.ecnu.edu.cn
  • 基金资助:
    上海市人工智能创新发展专项基金项目(2019-RGZN-01086)

Unsupervised author name disambiguation based on heterogeneous networks

Chenliang GUO1, Xin LIN1,*(), Yue YIN2   

  1. 1. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
    2. Shanghai Technology Development Co., Ltd., Shanghai 200031, China
  • Received:2020-09-18 Online:2021-11-25 Published:2021-11-26
  • Contact: Xin LIN E-mail:xlin@cs.ecnu.edu.cn

摘要:

作者名称消歧是构建学术知识图谱的重要步骤. 由于数据缺失、人名重名、人名缩写导致论文重名现象普遍存在, 针对无法充分利用信息和冷启动问题, 提出了基于异构网络的无监督作者名称消歧方法, 自动学习同作者论文特征. 用词形还原预处理作者、机构、标题、关键词的字符, 用word2vec和TF-IDF (Term Frequency–Inverse Document Frequency)方法学习文本特征嵌入表示, 用元路径随机游走和word2vec方法学习结构特征嵌入表示, 融合文本、结构特征相似度后用DBSCAN (Density-Based Spatial Clustering of Applications with Noise)聚类算法、合并孤立论文方法完成消歧. 最终根据实验结果, 模型在冷启动无监督作者名称消歧的小数据集和工程应用中优于现有模型, 表明了模型有效且可以实际应用.

关键词: 作者消歧, 学术知识图谱, 异构网络, 元路径随机游走

Abstract:

Author name disambiguation is an important step in constructing an academic knowledge graph. The issue of ambiguous names is widely prevalent in academic literature due to the presence of missing data, ambiguous names, or abbreviations. This paper proposes an unsupervised author name disambiguation method, based on heterogenous networks, with the goal of addressing the problems associated with inadequate information utilization and cold-start; the proposed method automatically learns the features of papers with the ambiguous authors’ name. As a starting point, the method preprocesses strings of authors, organizations, titles, and keywords by lemmatization. The algorithm then learns the embedded representation of text features by the word2vec and TF-IDF methods and learns the embedded representation of structural features using the meta-path random walk and word2vec methods. After merging features by similarity of structure and text, disambiguation is done by a DBSCAN clustering algorithm and merging isolated papers. Experimental results show that the proposed model significantly outperforms existing models in a small dataset and in engineering applications for cold-start unsupervised author name disambiguation. The data indicates that the model is effective and can be implemented in real-world applications.

Key words: author disambiguation, academic knowledge graph, heterogeneous network, meta-path random walk

中图分类号: