基于异构网络的无监督作者名称消歧

doi:10.3969/j.issn.1000-5641.2021.06.015

摘要/Abstract

摘要：

作者名称消歧是构建学术知识图谱的重要步骤. 由于数据缺失、人名重名、人名缩写导致论文重名现象普遍存在, 针对无法充分利用信息和冷启动问题, 提出了基于异构网络的无监督作者名称消歧方法, 自动学习同作者论文特征. 用词形还原预处理作者、机构、标题、关键词的字符, 用word2vec和TF-IDF (Term Frequency–Inverse Document Frequency)方法学习文本特征嵌入表示, 用元路径随机游走和word2vec方法学习结构特征嵌入表示, 融合文本、结构特征相似度后用DBSCAN (Density-Based Spatial Clustering of Applications with Noise)聚类算法、合并孤立论文方法完成消歧. 最终根据实验结果, 模型在冷启动无监督作者名称消歧的小数据集和工程应用中优于现有模型, 表明了模型有效且可以实际应用.

关键词: 作者消歧, 学术知识图谱, 异构网络, 元路径随机游走

Abstract:

Author name disambiguation is an important step in constructing an academic knowledge graph. The issue of ambiguous names is widely prevalent in academic literature due to the presence of missing data, ambiguous names, or abbreviations. This paper proposes an unsupervised author name disambiguation method, based on heterogenous networks, with the goal of addressing the problems associated with inadequate information utilization and cold-start; the proposed method automatically learns the features of papers with the ambiguous authors’ name. As a starting point, the method preprocesses strings of authors, organizations, titles, and keywords by lemmatization. The algorithm then learns the embedded representation of text features by the word2vec and TF-IDF methods and learns the embedded representation of structural features using the meta-path random walk and word2vec methods. After merging features by similarity of structure and text, disambiguation is done by a DBSCAN clustering algorithm and merging isolated papers. Experimental results show that the proposed model significantly outperforms existing models in a small dataset and in engineering applications for cold-start unsupervised author name disambiguation. The data indicates that the model is effective and can be implemented in real-world applications.

Key words: author disambiguation, academic knowledge graph, heterogeneous network, meta-path random walk

中图分类号:

TP182

郭晨亮, 林欣, 殷玥. 基于异构网络的无监督作者名称消歧[J]. 华东师范大学学报（自然科学版）, 2021, 2021(6): 147-160.

Chenliang GUO, Xin LIN, Yue YIN. Unsupervised author name disambiguation based on heterogeneous networks[J]. Journal of East China Normal University(Natural Science), 2021, 2021(6): 147-160.

图/表 12

表1

图1

图2

表2

表3

表4

AMiner测试集实验结果"

作者名称	本文方法			AMiner方法^[4]			OAG比赛第一名方法			概率模型方法^[14]			GHOST方法^[11]
作者名称	召回率/%	精确率/%	${F_1}/\%$	召回率/%	精确率/%	${F_1}/\%$	召回率/%	精确率/%	${F_1}/\%$	召回率/%	精确率/%	${F_1}/\%$	召回率/%	精确率/%	${F_1}/\%$
XU Xu	55.85	70.83	62.46	45.86	74.18	56.68	60.39	60.69	60.54	41.87	48.16	44.80	21.79	61.34	32.15
YU Rong	40.31	85.08	54.70	46.51	89.13	61.12	53.55	71.49	61.23	40.85	65.48	50.32	36.41	92.00	52.17
TIAN Yong	62.96	80.01	70.47	51.95	76.32	61.82	63.93	39.35	48.71	56.85	70.74	63.04	54.58	86.94	67.60
HAN Lu	22.39	64.63	33.26	28.05	51.78	36.39	48.31	42.47	45.21	20.62	47.88	28.82	17.39	69.72	27.84
HUANG Lin	44.52	86.29	58.74	32.87	77.10	46.09	59.17	69.84	64.07	34.17	71.84	46.31	17.25	86.15	28.74
XU Kexin	95.12	80.71	87.32	98.64	91.37	94.87	99.65	80.46	89.04	82.47	90.02	86.08	28.52	92.90	43.64
QUAN Wei	39.78	83.81	53.95	39.02	53.88	45.26	56.55	62.09	59.19	47.66	64.45	54.77	27.80	86.42	42.07
DENG Tao	42.13	80.22	54.24	43.62	81.63	56.86	54.18	60.53	57.18	29.89	53.04	38.23	24.50	73.33	36.73
LI Hongbin	70.84	89.10	78.93	69.21	77.20	72.99	72.16	56.64	63.47	53.05	54.66	53.84	29.12	56.29	38.39
BAI Hua	33.81	83.24	48.09	39.73	71.49	51.08	35.84	63.92	45.93	35.90	58.58	44.52	29.54	83.06	43.58
CHEN Meiling	45.07	92.79	60.67	44.70	74.93	55.99	49.57	54.17	51.77	28.80	59.36	38.79	23.85	86.11	37.35
WANG Yanqing	59.32	64.81	91.95	75.33	71.52	73.37	93.58	16.07	27.43	51.97	60.40	55.87	40.39	80.79	53.86
ZHANG Xudong	7.69	61.61	13.67	22.54	62.40	33.12	10.69	43.28	17.14	23.35	70.20	35.04	7.23	85.75	13.34
SHI Qiang	47.07	53.60	50.12	36.15	52.20	42.72	51.77	47.66	49.63	36.94	43.84	40.10	26.80	53.72	35.76
ZHENG Min	11.56	45.13	18.40	22.35	57.65	32.21	16.48	26.05	20.19	19.70	54.76	28.98	15.21	80.50	25.58
15个的平均值	45.23	74.79	56.37	46.44	70.85	56.10	50.05	52.98	51.47	40.27	60.89	48.48	26.69	78.33	39.81
100个的平均值	65.77	75.21	70.17	63.03	77.96	67.79	73.36	60.14	66.10	59.53	70.63	62.81	57.09	77.22	50.23

表4

表5

表6

表7

表8

表9

图3

参考文献 22

1	DONG Y, CHAWLA N V, SWAMI A. metapath2vec: Scalable representation learning for heterogeneous networks [C]// Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017: 135-144.
2	PEROZZI B, ALRFOU R, SKIENA S. Deepwalk: Online learning of social representations [C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
3	ROBERTSON S. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 2004, 60 (5): 503- 520. doi: 10.1108/00220410410560582
4	ZHANG Y, ZHANG F, YAO P, et al. Name disambiguation in AMiner: Clustering, maintenance, and human in the loop [C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2018: 1002-1011.
5	HAN H, GILES L, ZHA H, et al. Two supervised learning approaches for name disambiguation in author citations [C]// Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries. IEEE, 2004: 296-305.
6	VELOSO A, FERREIRA A A, GONCALVES M A, et al. Cost-effective on-demand associative author name disambiguation. Information Processing and Management, 2012, 48 (4): 680- 697. doi: 10.1016/j.ipm.2011.08.005
7	YOSHIDA M, IKEDA M, ONO S, et al. Person name disambiguation by bootstrapping [C]// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2010: 10-17.
8	HAN X, ZHAO J. Named entity disambiguation by leveraging wikipedia semantic knowledge [C]// Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2009: 215-224.
9	TANG J, ZHANG J, ZHANG D, et al. A unified framework for name disambiguation [C]// Proceedings of the 17th International Conference on World Wide Web. 2008: 1205-1206.
10	DENG C, DENG H, LI C. A scholar disambiguation method based on heterogeneous relation-fusion and attribute enhancement. IEEE Access, 2020, 8, 28375- 28384. doi: 10.1109/ACCESS.2020.2972372
11	FAN X, WANG J, PU X, et al. On graph-based name disambiguation. Journal of Data and Information Quality, 2011, 2 (2): 1- 23.
12	MALIN B. Unsupervised name disambiguation via social network similarity [C]// Proceedings of the Workshop on Link Analysis, Counterterrorism and Security. 2005: 93-102.
13	ZHANG W, YAN Z, ZHENG Y. Author name disambiguation using graph node embedding method [C]// Proceedings of the 2019 IEEE 23rd International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 2019: 410-415.
14	ZHANG B, HASAN M A. Name disambiguation in anonymized graphs using network embedding [C]// Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2017: 1239-1248.
15	KIM K, ROHATGI S, GILES C L. Hybrid dee pairwise classification for author name disambiguation [C]// Proceedings of the 2019 ACM on Conference on Information and Knowledge Management. 2019: 2369-2372.
16	PENG L, SHEN S, XU J, et al. Diting: An author disambiguation method based on network representation learning. IEEE Access, 2019, 7, 135539- 135555. doi: 10.1109/ACCESS.2019.2942477
17	PENG L, SHEN S, LI D, et al. Author disambiguation through adversarial network representation learning [C]// International Joint Conference on Neural Networks. 2019: paper N-19712.
18	WANG H, WANG R, WEN C, et al. Author name disambiguation on heterogeneous information network with adversarial representation learning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2020: 238-245.
19	QIAO Z, DU Y, FU Y, et al. Unsupervised author disambiguation using heterogeneous graph convolutional network embedding [C]// Proceedings of the 2019 IEEE International Conference on Big Data. IEEE, 2019: 910-919.
20	WANG X, TANG J, CHENG H, et al. ADANA: Active name disambiguation [C]// 2011 11th IEEE International Conference on Data Mining. IEEE, 2011: 794-803.
21	NG V. Machine learning for entity coreference resolution: A retrospective look at two decades of research [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2017: 4877–4884.
22	TANG X, ZHANG J, CHEN B, et al. BERT-INT: A BERT-based interaction model for knowledge graph alignment [C]// Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 2020: 3174-3180.

ID	关键词	机构	其他作者列表
${P_{\text{1}}}$	adaptive computing allocation, mobile cloud	School of Transportation and Logistics	XING Tianyi; CAI Lin; HUANG Dijiang; PENG Daiyuan; LIU Yan
${P_2}$	adaptive channel allocation, wireless algorithm	National United Engineering Laboratory of Integrated and Intelligent Transportation School	ZHANG Jin; LI Wei; GULLIVER Aaron
${P_3}$	network coding DVB-IPDC LTE	Secure Networking and Computing Institute, Arizona State University	WANG Lian; PENG Daiyuan
${P_{\text{4}}}$	5-axis machining G code, Interpolation NURBS	College of Mechanical Engineering	LI Xia
${P_{\text{5}}}$	5-axis NURBS surfaces STEP-NC	College of Mechanical and Electrical Engineering	LI Xia

数据集	作者名数量	论文数量	消歧后作者数量	机构缺失的论文数量	摘要缺失的论文数量	关键词缺失的论文数量
AMiner	600	203078	39781	134 (0.06%)	3118 (1.53%)	49132 (24%)
AMiner训练集	500	169720	33382	114 (0.06%)	2647 (1.56%)	41299 (24%)
AMiner测试集	100	35023	6399	20 (0.06%)	488 (1.39%)	8286 (24%)
SCI	13328290	18138796	14279136	830942 (5%)	4584943 (25%)	7748879 (43%)
SCI测试集	10	184	44	11 (6%)	62 (33%)	81 (44%)

作者名称	召回率/%	精确率/%	${F_1}/\%$
ABBAS Hazzim	100.00	100.00	100.00
AALKJAER Christian	90.00	100.00	94.74
ABEL Robert	93.69	43.24	59.18
AARABI Mahmoud	100.00	100.00	100.00
AAMIR Muhammad	87.50	100.00	93.33
ABE Yuki	83.33	100.00	90.91
ABBASI Shawn	100.00	100.00	100.00
ABE Kazuo	87.76	100.00	93.48
ABDULLAH Amin	65.22	48.39	55.56
ABAB Julia	80.00	66.67	72.73
平均值	88.75	85.83	87.27

模型类型	AMiner测试集100个作者			AMiner训练集500个作者			AMiner数据平均结果
模型类型	召回率/%	精确率/%	${F_1}/\%$	召回率/%	精确率/%	${F_1}/\%$	召回率/%	精确率/%	${F_1}/\%$
原始模型	65.77	75.21	70.17	64.56	70.21	67.27	64.76	71.04	67.75
只用结构特征	60.10	65.65	62.75	57.43	63.08	60.12	57.88	63.50	60.56
只用文本特征	87.65	41.04	55.90	86.71	38.41	53.24	86.87	38.85	53.69
去除词形还原	61.25	77.97	68.61	59.08	75.83	66.42	59.44	76.19	66.78
去除TF-IDF加权	62.99	75.72	68.77	62.99	69.79	66.22	62.99	70.78	66.66
去除词向量的随机打乱	55.92	78.76	65.40	54.49	76.46	63.63	54.73	76.84	63.92
去除关键词	63.00	75.79	68.81	62.90	69.61	66.08	62.92	70.64	66.56
去除来源	62.24	77.32	68.96	59.35	74.53	66.07	59.83	75.00	66.56
去除摘要	61.18	77.24	68.28	58.75	75.19	65.97	59.16	75.53	66.35

权值 $e$	召回率/%	精确率/%	${F_1}/\%$	权值 $e$	召回率/%	精确率/%	${F_1}/\%$
0.5	61.50	75.73	67.88	2.8	65.42	73.32	69.15
0.9	62.08	76.70	68.62	3.0	65.77	75.21	70.17
1.3	63.48	77.50	69.80	3.5	66.45	71.52	68.90
2.0	62.91	76.62	69.09	4.0	67.33	68.89	68.10
2.5	64.60	73.72	68.86	5.0	69.70	64.73	67.12