基于异构网络的无监督作者名称消歧

郭晨亮; 林欣; 殷玥

doi:10.3969/j.issn.1000-5641.2021.06.015

华东师范大学学报（自然科学版） >

2021 , Vol. 2021 >Issue 6: 147 - 160

DOI: https://doi.org/10.3969/j.issn.1000-5641.2021.06.015

计算机科学

基于异构网络的无监督作者名称消歧

郭晨亮 ,
林欣 ,
殷玥

展开

1. 华东师范大学计算机科学与技术学院, 上海　200062
2. 上海科技发展有限公司, 上海　200031

收稿日期: 2020-09-18

网络出版日期: 2021-11-26

基金资助

上海市人工智能创新发展专项基金项目(2019-RGZN-01086)

收起

Unsupervised author name disambiguation based on heterogeneous networks

Chenliang GUO ,
Xin LIN ,
Yue YIN

Expand

1. School of Computer Science and Technology, East China Normal University, Shanghai　200062, China
2. Shanghai Technology Development Co., Ltd., Shanghai　200031, China

Received date: 2020-09-18

Online published: 2021-11-26

Fold

摘要

作者名称消歧是构建学术知识图谱的重要步骤. 由于数据缺失、人名重名、人名缩写导致论文重名现象普遍存在, 针对无法充分利用信息和冷启动问题, 提出了基于异构网络的无监督作者名称消歧方法, 自动学习同作者论文特征. 用词形还原预处理作者、机构、标题、关键词的字符, 用word2vec和TF-IDF (Term Frequency–Inverse Document Frequency)方法学习文本特征嵌入表示, 用元路径随机游走和word2vec方法学习结构特征嵌入表示, 融合文本、结构特征相似度后用DBSCAN (Density-Based Spatial Clustering of Applications with Noise)聚类算法、合并孤立论文方法完成消歧. 最终根据实验结果, 模型在冷启动无监督作者名称消歧的小数据集和工程应用中优于现有模型, 表明了模型有效且可以实际应用.

关键词： 作者消歧; 学术知识图谱; 异构网络; 元路径随机游走

本文引用格式

郭晨亮 , 林欣 , 殷玥 . 基于异构网络的无监督作者名称消歧[J]. 华东师范大学学报（自然科学版）, 2021 , 2021(6) : 147 -160 . DOI: 10.3969/j.issn.1000-5641.2021.06.015

Abstract

Author name disambiguation is an important step in constructing an academic knowledge graph. The issue of ambiguous names is widely prevalent in academic literature due to the presence of missing data, ambiguous names, or abbreviations. This paper proposes an unsupervised author name disambiguation method, based on heterogenous networks, with the goal of addressing the problems associated with inadequate information utilization and cold-start; the proposed method automatically learns the features of papers with the ambiguous authors’ name. As a starting point, the method preprocesses strings of authors, organizations, titles, and keywords by lemmatization. The algorithm then learns the embedded representation of text features by the word2vec and TF-IDF methods and learns the embedded representation of structural features using the meta-path random walk and word2vec methods. After merging features by similarity of structure and text, disambiguation is done by a DBSCAN clustering algorithm and merging isolated papers. Experimental results show that the proposed model significantly outperforms existing models in a small dataset and in engineering applications for cold-start unsupervised author name disambiguation. The data indicates that the model is effective and can be implemented in real-world applications.

Key words： author disambiguation; academic knowledge graph; heterogeneous network; meta-path random walk

参考文献

1	DONG Y, CHAWLA N V, SWAMI A. metapath2vec: Scalable representation learning for heterogeneous networks [C]// Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017: 135-144.
2	PEROZZI B, ALRFOU R, SKIENA S. Deepwalk: Online learning of social representations [C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
3	ROBERTSON S. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 2004, 60 (5): 503- 520.
4	ZHANG Y, ZHANG F, YAO P, et al. Name disambiguation in AMiner: Clustering, maintenance, and human in the loop [C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2018: 1002-1011.
5	HAN H, GILES L, ZHA H, et al. Two supervised learning approaches for name disambiguation in author citations [C]// Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries. IEEE, 2004: 296-305.
6	VELOSO A, FERREIRA A A, GONCALVES M A, et al. Cost-effective on-demand associative author name disambiguation. Information Processing and Management, 2012, 48 (4): 680- 697.
7	YOSHIDA M, IKEDA M, ONO S, et al. Person name disambiguation by bootstrapping [C]// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2010: 10-17.
8	HAN X, ZHAO J. Named entity disambiguation by leveraging wikipedia semantic knowledge [C]// Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2009: 215-224.
9	TANG J, ZHANG J, ZHANG D, et al. A unified framework for name disambiguation [C]// Proceedings of the 17th International Conference on World Wide Web. 2008: 1205-1206.
10	DENG C, DENG H, LI C. A scholar disambiguation method based on heterogeneous relation-fusion and attribute enhancement. IEEE Access, 2020, 8, 28375- 28384.
11	FAN X, WANG J, PU X, et al. On graph-based name disambiguation. Journal of Data and Information Quality, 2011, 2 (2): 1- 23.
12	MALIN B. Unsupervised name disambiguation via social network similarity [C]// Proceedings of the Workshop on Link Analysis, Counterterrorism and Security. 2005: 93-102.
13	ZHANG W, YAN Z, ZHENG Y. Author name disambiguation using graph node embedding method [C]// Proceedings of the 2019 IEEE 23rd International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 2019: 410-415.
14	ZHANG B, HASAN M A. Name disambiguation in anonymized graphs using network embedding [C]// Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2017: 1239-1248.
15	KIM K, ROHATGI S, GILES C L. Hybrid dee pairwise classification for author name disambiguation [C]// Proceedings of the 2019 ACM on Conference on Information and Knowledge Management. 2019: 2369-2372.
16	PENG L, SHEN S, XU J, et al. Diting: An author disambiguation method based on network representation learning. IEEE Access, 2019, 7, 135539- 135555.
17	PENG L, SHEN S, LI D, et al. Author disambiguation through adversarial network representation learning [C]// International Joint Conference on Neural Networks. 2019: paper N-19712.
18	WANG H, WANG R, WEN C, et al. Author name disambiguation on heterogeneous information network with adversarial representation learning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2020: 238-245.
19	QIAO Z, DU Y, FU Y, et al. Unsupervised author disambiguation using heterogeneous graph convolutional network embedding [C]// Proceedings of the 2019 IEEE International Conference on Big Data. IEEE, 2019: 910-919.
20	WANG X, TANG J, CHENG H, et al. ADANA: Active name disambiguation [C]// 2011 11th IEEE International Conference on Data Mining. IEEE, 2011: 794-803.
21	NG V. Machine learning for entity coreference resolution: A retrospective look at two decades of research [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2017: 4877–4884.
22	TANG X, ZHANG J, CHEN B, et al. BERT-INT: A BERT-based interaction model for knowledge graph alignment [C]// Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 2020: 3174-3180.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献