华东师范大学学报(自然科学版) ›› 2018, Vol. 2018 ›› Issue (5): 41-55.doi: 10.3969/j.issn.1000-5641.2018.05.004
李娜, 金冈增, 周晓旭, 郑建兵, 高明
收稿日期:
2018-07-04
出版日期:
2018-09-25
发布日期:
2018-09-26
通讯作者:
郑建兵,男,博士,研究方向为信息处理技术.E-mail:zhengjb@js.chinamobile.com.
E-mail:zhengjb@js.chinamobile.com
作者简介:
李娜,女,硕士研究生,研究方向为数据挖掘.E-mail:nali0606@foxmail.com.
基金资助:
LI Na, JIN Gang-zeng, ZHOU Xiao-xu, ZHENG Jian-bing, GAO Ming
Received:
2018-07-04
Online:
2018-09-25
Published:
2018-09-26
摘要: 互联网、物联网和云计算技术的不断融合,使得各行各业信息化程度越来越高,但同时也带来了数据碎片化的问题.数据碎片化的海量性、异构性、隐私性、相依性和低质性等特征,导致了数据可用性较差,利用这些数据难以挖掘出准确而完整的信息.为了更有效地利用数据,实体匹配、融合和消歧变得尤为重要.主要对异构网络中实体匹配算法进行了综述,对实体相似度度量和数据预处理技术进行了梳理;特别针对海量数据,概述了可扩展实体匹配方法的研究进展,综述了运用监督学习和非监督学习两类技术的实体匹配算法.
中图分类号:
李娜, 金冈增, 周晓旭, 郑建兵, 高明. 异构网络中实体匹配算法综述[J]. 华东师范大学学报(自然科学版), 2018, 2018(5): 41-55.
LI Na, JIN Gang-zeng, ZHOU Xiao-xu, ZHENG Jian-bing, GAO Ming. A survey of entity matching algorithms in heterogeneous networks[J]. Journal of East China Normal University(Natural Sc, 2018, 2018(5): 41-55.
[1] WU X, ZHU X, WU G Q, et al. Data mining with big data[J]. IEEE transactions on knowledge and data engineering, 2014, 26(1):97-107. [2] 赵国栋. 大数据时代的历史机遇:产业变革与数据科学[M]. 北京:清华大学出版社, 2013. [3] GU B, LI Z, ZHANG X, et al. The interaction between schema matching and record matching in data integration[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(1):186-199. [4] LI C, JIN L, MEHROTRA S. Supporting efficient record linkage for large data sets using mapping techniques[J]. World Wide Web, 2006, 9(4):557-584. [5] WHANG S E, GARCIA-MOLINA H. Incremental entity resolution on rules and data[J]. The VLDB Journal, 2014, 23(1):77-102. [6] GETOOR L, MACHANAVAJJHALA A. Entity resolution:theory, practice & open challenges[J]. Proceedings of the VLDB Endowment, 2012, 5(12):2018-2019. [7] TEJADA S, KNOBLOCK C A, Minton S. Learning Object Identification Rules for Information Integration[J]. Information Systems, 2001, 26(8):607-633. [8] LEITAO L, CALADO P, HERSCHEL M. Efficient and effective duplicate detection in hierarchical data[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(5):1028-1041. [9] DUNN H L. Record linkage[J]. American Journal of Public Health and the Nations Health, 1946, 36(12):1412-1416. [10] LIU S, WANG S, ZHU F, et al. Hydra:Large-scale social identity linkage via heterogeneous behavior modeling[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 2014:51-62. [11] LIU J, LEI K H, LIU J Y, et al. Ranking-based name matching for author disambiguation in bibliographic data[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 8. [12] KONG C, GAO M, XU C, et al. Entity matching across multiple heterogeneous data sources[C]//International Conference on Database Systems for Advanced Applications. Cham:Springer, 2016:133-146. [13] KEJRIWAL M, MIRANKER D P. Semi-supervised instance matching using boosted classifiers[C]//European Semantic Web Conference. Cham:Springer, 2015:388-402. [14] KONDA P, DAS S, SUGANTHAN GC P, et al. Magellan:Toward building entity matching management systems[J]. Proceedings of the VLDB Endowment, 2016, 12(9):1197-1208. [15] WHANG S E, MENESTRINA D, KOUTRIKA G, et al. Entity resolution with iterative blocking[C]//Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009:219-232. [16] KARAPIPERIS D, VERYKIOS V S. An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(4):909-921. [17] GRUENHEID A, DONG X L, SRIVASTAVA D. Incremental record linkage[J]. Proceedings of the VLDB Endowment, 2014, 9(7):697-708. [18] KIM H, LEE D. HARRA:Fast iterative hashed record linkage for large-scale data collections[C]//Proceedings of the 13th International Conference on Extending Database Technology. ACM, 2010:525-536. [19] ZHANG Y, TANG J, YANG Z, et al. Cosnet:Connecting heterogeneous social networks with local and global consistency[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015:1485-1494. [20] CHRISTEN P. A survey of indexing techniques for scalable record linkage and deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9):1537-1555. [21] WANG J, LI G, YU J X, et al. Entity matching:How similar is similar[J]. Proceedings of the VLDB Endowment, 2011, 10(4):622-633. [22] SONG Y, KIMURA T, BATJARGAL B, et al. Cross-language record linkage using word embedding driven metadata similarity measurement[C/OL]//International Semantic Web Conference (Posters & Demos). (2016-09-28)[2018-06-01]. http://ceur-ws.org/vol-1690/paper90.pdf. [23] YANG K H, PENG H T, JIANG J Y, et al. Author name disambiguation for citations using topic and web correlation[C]//International Conference on Theory and Practice of Digital Libraries. Berlin:Springer, 2008:185-196. [24] VAN Der LOO M P J. The stringdist package for approximate string matching[J]. The R Journal, 2014, 6(1):111-122. [25] LI C L, SU Y C, LIN T W, et al. Combination of feature engineering and ranking models for paper-author identification in KDD Cup 2013[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 2. [26] PELED O, FIRE M, ROKACH L, et al. Entity matching in online social networks[C]//Social Computing (SocialCom), 2013 International Conference on. IEEE, 2013:339-344. [27] LI J, LIANG X, DING W, et al. Feature engineering and tree modeling for author-paper identification challenge[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 5. [28] PELED O, FIRE M, ROKACH L, et al. Matching entities across online social networks[J]. Neurocomputing, 2016, 210(C):91-106. [29] TANG J, FONG A C M, WANG B, et al. A unified probabilistic framework for name disambiguation in digital library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6):975-987. [30] KOUDAS N, SARAWAGI S, SRIVASTAVA D. Record linkage:similarity measures and algorithms[C]//Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM, 2006:802-803. [31] VESDAPUNT N, GARCIA-MOLINA H. Identifying users in social networks with limited information[C]//Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 2015:627-638. [32] LEE J Y, HUSSAIN R, RIVERA V, et al. Second-level degree-based entity resolution in online social networks[J/OL]. Social Network Analysis and Mining, 2018, 8:19. (2018-03-16)[2018-06-01]. https://link.springer.com/content/pdf/10.1007%2Fs13278-018-0499-9.pdf. [33] ZHOU X, LIANG X, ZHANG H, et al. Cross-platform identification of anonymous identical users in multiple social media networks[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(2):411-424. [34] ZHANG J, YU P S, ZHOU Z H. Meta-path based multi-network collective link prediction[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014:1286-1295. [35] LEVIN F H, HEUSER C A. Evaluating the use of social networks in author name disambiguation in digital libraries[J]. Journal of Information and Data Management, 2010(2):183-197. [36] POOJA K M, MONDAL S, CHANDRA J. An unsupervised heuristic based approach for author name disambiguation[C]//Communication Systems and Networks (COMSNETS), 201810th International Conference on. IEEE, 2018:540-542. [37] ZHONG E, LI L, WANG N, et al. Contextual rule-based feature engineering for author-paper identification[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 6. [38] PEROZZI B, AL-RFOU r, SKIENA S. Deepwalk:Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014:701-710. [39] GROVER A, LESKOVEC J. node2vec:Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016:855-864. [40] TANG J, QU M, WANG M, et al. Line:Large-scale information network embedding[C]//Proceedings of the 24th International Conference on World Wide Web, International World Wide Web, Conferences Steering Committee, 2015:1067-1077. [41] ZHOU X, LIANG X, DU X, et al. Structure based user identification across social networks[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(6):1178-1191. [42] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv:1301.3781v3[cs.CV] 7 Sep 2013. [43] LIU S, WANG S, ZHU F. Structured learning from heterogeneous behavior for social identity linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(7):2005-2019. [44] DEY D. Entity matching in heterogeneous databases:A logistic regression approach[J]. Decision Support Systems, 2008, 44(3):740-747. [45] ZAFARANI R, LIU H. Connecting users across social media sites:a behavioral-modeling approach[C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2013:41-49. [46] ZHOU Y, HOWROYD J, DANICIC S, et al. Extending naive bayes classifier with hierarchy feature level information for record linkage[C]//Workshop on Advanced Methodologies for Bayesian Networks. Cham:Springer, 2015:93-104. [47] LI L, LI J, GAO H. Rule-based method for entity resolution[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(1):250-263. [48] CHIN W S, ZHUANG Y, JUAN Y C, et al. Effective string processing and matching for author disambiguation[J]. The Journal of Machine Learning Research, 2014, 15(1):3037-3064. [49] JIANG Y, LIN C, MENG W, et al. Rule-based deduplication of article records from bibliographic databases[J]. Database, 2014:Article ID bat086. DOI:10.1093/database/bat086. [50] SETOGUCHI S, ZHU Y, JALBERT J J, et al. Validity of deterministic record linkage using multiple indirect personal identifiers:linking a large registry to claims data[J]. Circulation:Cardiovascular Quality and Outcomes, 2014, 7(3):475-480. [51] EKTEFA M, JABAR M A, Sidi F, et al. A threshold-based similarity measure for duplicate detection[C]//Open Systems (ICOS), 2011 IEEE Conference on. IEEE, 2011:37-41. [52] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967:281-297. [53] MONIKA S, ANAND C, GNANAMURTHY R K. Analyzing the User Profile Linkage across Different Social Network Platforms[J]. International Journal of Computer Science and Engineering, 2016, 4(2):1378-1383. [54] CEN L, DRAGUT E C, SI L, et al. Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2013:741-744. [55] SONG Y, HUANG J, COUNCILL I G, et al. Efficient topic-based unsupervised name disambiguation[C]//Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2007:342-351. [56] QIAN Y, HU Y, CUI J, et al. Combining machine learning and human judgment in author disambiguation[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 2011:1241-1246. [57] FELLEGI I P, SUNTER A B. A theory for record linkage[J]. Journal of the American Statistical Association, 1969, 64(328):1183-1210. [58] WINKLER W E. Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage[C]//Proceedings of the Section on Survey Research Methods, American Statistical Association. 1988:667-671. [59] BOYD S, VANDENBERGHE L. Convex Optimization[M]. Cambridge:Cambridge University Press, 2004. [60] GAO M, LIM E P, LO D, et al. CNL:Collective network linkage across heterogeneous social platforms[C]//IEEE International Conference on Data Mining. IEEE Computer Society, 2015:757-762. [61] YAKOUT M, ELMAGARMID A K, Elmeleegy H, et al. Behavior based record linkage[J]. Proceedings of the VLDB Endowment, 2010, 3(1/2):439-448. [62] FANG Y, CHANG M W. Entity linking on microblogs with spatial and temporal signals[J]. Transactions of the Association for Computational Linguistics, 2014, 2(1):259-272. [63] HAN H, XU W, ZHA H, et al. A hierarchical naive Bayes mixture model for name disambiguation in author citations[C]//Proceedings of the 2005 ACM Symposium on Applied Computing. ACM, 2005:1065-1069. [64] SINGLA P, DOMINGOS P. Collective object identification[C]//International Joint Conference on Artificial Intelligence.[S.l.]. Morgan Kaufmann Publishers Inc, 2005:1636-1637. [65] BALAJI J, MIN C, JAVED F, et al. Avatar:Large scale entity resolution of heterogeneous user profiles[C]//Proceedings of the 2nd Workshop on Data Management for End-To-End Machine Learning. ACM, 2018:Article No 2. [66] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of machine Learning Research, 2003(3):993-1022. [67] ZHAO W X, JIANG J, WENG J, et al. Comparing twitter and traditional media using topic models[C]//European Conference on Information Retrieval. Berlin:Springer, 2011:338-349. [68] YANG Y, SUN Y, TANG J, et al. Entity matching across heterogeneous sources[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015:1395-1404. [69] BHATTACHARYA I, GETOOR L. A latent dirichlet model for unsupervised entity resolution[C]//Proceedings of the 2006 SIAM International Conference on Data Mining,[S.l.]:Society for Industrial and Applied Mathematics, 2006:47-58. [70] SEN P. Collective context-aware topic models for entity disambiguation[C]//Proceedings of the 21st International Conference on World Wide Web. ACM, 2012:729-738. [71] TANG J, FANG Z, SUN J. Incorporating social context and domain knowledge for entity recognition[C]//Proceedings of the 24th International Conference on World Wide Web.[S.l.]:International World Wide Web Conferences Steering Committee, 2015:517-526. [72] STEORTS R C, HALL R, FIENBERG S E. A bayesian approach to graphical record linkage and deduplication[J]. Journal of the American Statistical Association, 2016, 516(111):1660-1672. [73] MAURYA A, TELANG R. Bayesian multi-view models for member-job matching and personalized skill recommendations[C]//Big Data, 2017 IEEE International Conference on. IEEE, 2017:1193-1202. |
[1] | 王婕, 黄温瑞, 赵生宇, 夏小雅, 韩凡宇, 王伟, 张琰彬. OpenRank贡献度评估方法及其在开源课程中的实证研究[J]. 华东师范大学学报(自然科学版), 2024, 2024(5): 11-19. |
[2] | 张嘉睿, 张豈明, 毕枫林, 张琰彬, 王伟, 任而今, 张海立. 基于IPEX-LLM的本地轻量化课程教学智能辅助系统[J]. 华东师范大学学报(自然科学版), 2024, 2024(5): 162-172. |
[3] | 冯艳丽, 周宇, 黄福兴, 万俊岭, 袁培森. 基于SF-Transformer的智能教育平台短期电力负荷预测研究[J]. 华东师范大学学报(自然科学版), 2024, 2024(5): 173-182. |
[4] | 任俊霖, 王欢, 黄骁迪, 李艳婷, 琚生根. 基于序列感知与多元行为数据的MOOCs知识概念推荐[J]. 华东师范大学学报(自然科学版), 2024, 2024(5): 45-56. |
[5] | 陈郅睿, 陆雪松. 基于开源代码大语言模型提示的学生代码修复[J]. 华东师范大学学报(自然科学版), 2024, 2024(5): 93-103. |
[6] | 谢琳玎, 张远, 蔡亦红. 基于生物信息学构建肝癌免疫预后基因模型及初步验证[J]. 华东师范大学学报(自然科学版), 2024, 2024(4): 100-110. |
[7] | 饶思敬, 辛颖, 潘俊君. 角色动画中基于隐式曲面的皮肤变形方法[J]. 华东师范大学学报(自然科学版), 2024, 2024(2): 143-156. |
[8] | 冯路平, 施力业, 吴雯, 郑骏, 胡文心, 郑巍. 基于偏好级别陌生人信息辅助的推荐系统[J]. 华东师范大学学报(自然科学版), 2024, 2024(2): 53-64. |
[9] | 张永卓, 诸葛晴凤, 沙行勉, 宋玉红. 基于并行块的自适应量化随机计算[J]. 华东师范大学学报(自然科学版), 2024, 2024(2): 76-85. |
[10] | 卢欣, 黄昶, 金志伟. 基于转台的多视角多姿态锁销点云模型重建[J]. 华东师范大学学报(自然科学版), 2024, 2024(2): 86-96. |
[11] | 唐凌霄, 黄昶. 基于双层局部能量因子的红外小目标检测方法[J]. 华东师范大学学报(自然科学版), 2024, 2024(2): 97-107. |
[12] | 王世瑞, 沈芳, 李仁虎, 李鹏. 无人机高光谱影像水面耀光去除及信息重构方法研究[J]. 华东师范大学学报(自然科学版), 2024, 2024(1): 36-49. |
[13] | 肖开研, 廉洁. 基于多核支持向量机的句子分类算法[J]. 华东师范大学学报(自然科学版), 2023, 2023(6): 85-94. |
[14] | 崔瑞博, 王峰. 动量更新与重构约束的限制视角下3D物品识别[J]. 华东师范大学学报(自然科学版), 2023, 2023(6): 61-72. |
[15] | 陈道佳, 陈志云. 分层级描述感知的个性化推荐系统[J]. 华东师范大学学报(自然科学版), 2023, 2023(6): 73-84. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||