综述论文

异构网络中实体匹配算法综述

  • 李娜 ,
  • 金冈增 ,
  • 周晓旭 ,
  • 郑建兵 ,
  • 高明
展开
  • 华东师范大学 数据科学与工程学院, 上海 200062
李娜,女,硕士研究生,研究方向为数据挖掘.E-mail:nali0606@foxmail.com.

收稿日期: 2018-07-04

  网络出版日期: 2018-09-26

基金资助

国家重点研发计划项目(2016YFB1000905);国家自然科学基金广东省联合重点项目(U1401256);国家自然科学基金(61672234,61502236,61472321);上海市科技兴农推广项目(T20170303)

A survey of entity matching algorithms in heterogeneous networks

  • LI Na ,
  • JIN Gang-zeng ,
  • ZHOU Xiao-xu ,
  • ZHENG Jian-bing ,
  • GAO Ming
Expand
  • School of Data Science and Engineering, East China Normal University, Shanghai 200062, China

Received date: 2018-07-04

  Online published: 2018-09-26

摘要

互联网、物联网和云计算技术的不断融合,使得各行各业信息化程度越来越高,但同时也带来了数据碎片化的问题.数据碎片化的海量性、异构性、隐私性、相依性和低质性等特征,导致了数据可用性较差,利用这些数据难以挖掘出准确而完整的信息.为了更有效地利用数据,实体匹配、融合和消歧变得尤为重要.主要对异构网络中实体匹配算法进行了综述,对实体相似度度量和数据预处理技术进行了梳理;特别针对海量数据,概述了可扩展实体匹配方法的研究进展,综述了运用监督学习和非监督学习两类技术的实体匹配算法.

本文引用格式

李娜 , 金冈增 , 周晓旭 , 郑建兵 , 高明 . 异构网络中实体匹配算法综述[J]. 华东师范大学学报(自然科学版), 2018 , 2018(5) : 41 -55 . DOI: 10.3969/j.issn.1000-5641.2018.05.004

Abstract

The continuous integration of Internet, Internet of Things, and cloud computing technologies has been improving digitization across different industries, but it has also introduced increased data fragmentation. Data fragmentation is characterized by mass, heterogeneity, privacy, dependence, and low quality, resulting in poor data availability. As a result, it is often difficult to obtain accurate and complete information for many analytical tasks. To make effective use of data, entity matching, fusion, and disambiguation are of particular significance. In this paper, we summarize data preprocessing, similarity measurements, and entity matching algorithms of heterogeneous networks. In addition, particularly for large datasets, we investigate scalable entity matching algorithms. Existing entity matching algorithms can be categorized into two groups:supervised and unsupervised learning-based algorithms. We conclude the study with research progress on entity matching and topics for future research.

参考文献

[1] WU X, ZHU X, WU G Q, et al. Data mining with big data[J]. IEEE transactions on knowledge and data engineering, 2014, 26(1):97-107.
[2] 赵国栋. 大数据时代的历史机遇:产业变革与数据科学[M]. 北京:清华大学出版社, 2013.
[3] GU B, LI Z, ZHANG X, et al. The interaction between schema matching and record matching in data integration[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(1):186-199.
[4] LI C, JIN L, MEHROTRA S. Supporting efficient record linkage for large data sets using mapping techniques[J]. World Wide Web, 2006, 9(4):557-584.
[5] WHANG S E, GARCIA-MOLINA H. Incremental entity resolution on rules and data[J]. The VLDB Journal, 2014, 23(1):77-102.
[6] GETOOR L, MACHANAVAJJHALA A. Entity resolution:theory, practice & open challenges[J]. Proceedings of the VLDB Endowment, 2012, 5(12):2018-2019.
[7] TEJADA S, KNOBLOCK C A, Minton S. Learning Object Identification Rules for Information Integration[J]. Information Systems, 2001, 26(8):607-633.
[8] LEITAO L, CALADO P, HERSCHEL M. Efficient and effective duplicate detection in hierarchical data[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(5):1028-1041.
[9] DUNN H L. Record linkage[J]. American Journal of Public Health and the Nations Health, 1946, 36(12):1412-1416.
[10] LIU S, WANG S, ZHU F, et al. Hydra:Large-scale social identity linkage via heterogeneous behavior modeling[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 2014:51-62.
[11] LIU J, LEI K H, LIU J Y, et al. Ranking-based name matching for author disambiguation in bibliographic data[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 8.
[12] KONG C, GAO M, XU C, et al. Entity matching across multiple heterogeneous data sources[C]//International Conference on Database Systems for Advanced Applications. Cham:Springer, 2016:133-146.
[13] KEJRIWAL M, MIRANKER D P. Semi-supervised instance matching using boosted classifiers[C]//European Semantic Web Conference. Cham:Springer, 2015:388-402.
[14] KONDA P, DAS S, SUGANTHAN GC P, et al. Magellan:Toward building entity matching management systems[J]. Proceedings of the VLDB Endowment, 2016, 12(9):1197-1208.
[15] WHANG S E, MENESTRINA D, KOUTRIKA G, et al. Entity resolution with iterative blocking[C]//Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009:219-232.
[16] KARAPIPERIS D, VERYKIOS V S. An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(4):909-921.
[17] GRUENHEID A, DONG X L, SRIVASTAVA D. Incremental record linkage[J]. Proceedings of the VLDB Endowment, 2014, 9(7):697-708.
[18] KIM H, LEE D. HARRA:Fast iterative hashed record linkage for large-scale data collections[C]//Proceedings of the 13th International Conference on Extending Database Technology. ACM, 2010:525-536.
[19] ZHANG Y, TANG J, YANG Z, et al. Cosnet:Connecting heterogeneous social networks with local and global consistency[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015:1485-1494.
[20] CHRISTEN P. A survey of indexing techniques for scalable record linkage and deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9):1537-1555.
[21] WANG J, LI G, YU J X, et al. Entity matching:How similar is similar[J]. Proceedings of the VLDB Endowment, 2011, 10(4):622-633.
[22] SONG Y, KIMURA T, BATJARGAL B, et al. Cross-language record linkage using word embedding driven metadata similarity measurement[C/OL]//International Semantic Web Conference (Posters & Demos). (2016-09-28)[2018-06-01]. http://ceur-ws.org/vol-1690/paper90.pdf.
[23] YANG K H, PENG H T, JIANG J Y, et al. Author name disambiguation for citations using topic and web correlation[C]//International Conference on Theory and Practice of Digital Libraries. Berlin:Springer, 2008:185-196.
[24] VAN Der LOO M P J. The stringdist package for approximate string matching[J]. The R Journal, 2014, 6(1):111-122.
[25] LI C L, SU Y C, LIN T W, et al. Combination of feature engineering and ranking models for paper-author identification in KDD Cup 2013[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 2.
[26] PELED O, FIRE M, ROKACH L, et al. Entity matching in online social networks[C]//Social Computing (SocialCom), 2013 International Conference on. IEEE, 2013:339-344.
[27] LI J, LIANG X, DING W, et al. Feature engineering and tree modeling for author-paper identification challenge[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 5.
[28] PELED O, FIRE M, ROKACH L, et al. Matching entities across online social networks[J]. Neurocomputing, 2016, 210(C):91-106.
[29] TANG J, FONG A C M, WANG B, et al. A unified probabilistic framework for name disambiguation in digital library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6):975-987.
[30] KOUDAS N, SARAWAGI S, SRIVASTAVA D. Record linkage:similarity measures and algorithms[C]//Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM, 2006:802-803.
[31] VESDAPUNT N, GARCIA-MOLINA H. Identifying users in social networks with limited information[C]//Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 2015:627-638.
[32] LEE J Y, HUSSAIN R, RIVERA V, et al. Second-level degree-based entity resolution in online social networks[J/OL]. Social Network Analysis and Mining, 2018, 8:19. (2018-03-16)[2018-06-01]. https://link.springer.com/content/pdf/10.1007%2Fs13278-018-0499-9.pdf.
[33] ZHOU X, LIANG X, ZHANG H, et al. Cross-platform identification of anonymous identical users in multiple social media networks[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(2):411-424.
[34] ZHANG J, YU P S, ZHOU Z H. Meta-path based multi-network collective link prediction[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014:1286-1295.
[35] LEVIN F H, HEUSER C A. Evaluating the use of social networks in author name disambiguation in digital libraries[J]. Journal of Information and Data Management, 2010(2):183-197.
[36] POOJA K M, MONDAL S, CHANDRA J. An unsupervised heuristic based approach for author name disambiguation[C]//Communication Systems and Networks (COMSNETS), 201810th International Conference on. IEEE, 2018:540-542.
[37] ZHONG E, LI L, WANG N, et al. Contextual rule-based feature engineering for author-paper identification[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 6.
[38] PEROZZI B, AL-RFOU r, SKIENA S. Deepwalk:Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014:701-710.
[39] GROVER A, LESKOVEC J. node2vec:Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016:855-864.
[40] TANG J, QU M, WANG M, et al. Line:Large-scale information network embedding[C]//Proceedings of the 24th International Conference on World Wide Web, International World Wide Web, Conferences Steering Committee, 2015:1067-1077.
[41] ZHOU X, LIANG X, DU X, et al. Structure based user identification across social networks[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(6):1178-1191.
[42] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv:1301.3781v3[cs.CV] 7 Sep 2013.
[43] LIU S, WANG S, ZHU F. Structured learning from heterogeneous behavior for social identity linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(7):2005-2019.
[44] DEY D. Entity matching in heterogeneous databases:A logistic regression approach[J]. Decision Support Systems, 2008, 44(3):740-747.
[45] ZAFARANI R, LIU H. Connecting users across social media sites:a behavioral-modeling approach[C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2013:41-49.
[46] ZHOU Y, HOWROYD J, DANICIC S, et al. Extending naive bayes classifier with hierarchy feature level information for record linkage[C]//Workshop on Advanced Methodologies for Bayesian Networks. Cham:Springer, 2015:93-104.
[47] LI L, LI J, GAO H. Rule-based method for entity resolution[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(1):250-263.
[48] CHIN W S, ZHUANG Y, JUAN Y C, et al. Effective string processing and matching for author disambiguation[J]. The Journal of Machine Learning Research, 2014, 15(1):3037-3064.
[49] JIANG Y, LIN C, MENG W, et al. Rule-based deduplication of article records from bibliographic databases[J]. Database, 2014:Article ID bat086. DOI:10.1093/database/bat086.
[50] SETOGUCHI S, ZHU Y, JALBERT J J, et al. Validity of deterministic record linkage using multiple indirect personal identifiers:linking a large registry to claims data[J]. Circulation:Cardiovascular Quality and Outcomes, 2014, 7(3):475-480.
[51] EKTEFA M, JABAR M A, Sidi F, et al. A threshold-based similarity measure for duplicate detection[C]//Open Systems (ICOS), 2011 IEEE Conference on. IEEE, 2011:37-41.
[52] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967:281-297.
[53] MONIKA S, ANAND C, GNANAMURTHY R K. Analyzing the User Profile Linkage across Different Social Network Platforms[J]. International Journal of Computer Science and Engineering, 2016, 4(2):1378-1383.
[54] CEN L, DRAGUT E C, SI L, et al. Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2013:741-744.
[55] SONG Y, HUANG J, COUNCILL I G, et al. Efficient topic-based unsupervised name disambiguation[C]//Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2007:342-351.
[56] QIAN Y, HU Y, CUI J, et al. Combining machine learning and human judgment in author disambiguation[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 2011:1241-1246.
[57] FELLEGI I P, SUNTER A B. A theory for record linkage[J]. Journal of the American Statistical Association, 1969, 64(328):1183-1210.
[58] WINKLER W E. Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage[C]//Proceedings of the Section on Survey Research Methods, American Statistical Association. 1988:667-671.
[59] BOYD S, VANDENBERGHE L. Convex Optimization[M]. Cambridge:Cambridge University Press, 2004.
[60] GAO M, LIM E P, LO D, et al. CNL:Collective network linkage across heterogeneous social platforms[C]//IEEE International Conference on Data Mining. IEEE Computer Society, 2015:757-762.
[61] YAKOUT M, ELMAGARMID A K, Elmeleegy H, et al. Behavior based record linkage[J]. Proceedings of the VLDB Endowment, 2010, 3(1/2):439-448.
[62] FANG Y, CHANG M W. Entity linking on microblogs with spatial and temporal signals[J]. Transactions of the Association for Computational Linguistics, 2014, 2(1):259-272.
[63] HAN H, XU W, ZHA H, et al. A hierarchical naive Bayes mixture model for name disambiguation in author citations[C]//Proceedings of the 2005 ACM Symposium on Applied Computing. ACM, 2005:1065-1069.
[64] SINGLA P, DOMINGOS P. Collective object identification[C]//International Joint Conference on Artificial Intelligence.[S.l.]. Morgan Kaufmann Publishers Inc, 2005:1636-1637.
[65] BALAJI J, MIN C, JAVED F, et al. Avatar:Large scale entity resolution of heterogeneous user profiles[C]//Proceedings of the 2nd Workshop on Data Management for End-To-End Machine Learning. ACM, 2018:Article No 2.
[66] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of machine Learning Research, 2003(3):993-1022.
[67] ZHAO W X, JIANG J, WENG J, et al. Comparing twitter and traditional media using topic models[C]//European Conference on Information Retrieval. Berlin:Springer, 2011:338-349.
[68] YANG Y, SUN Y, TANG J, et al. Entity matching across heterogeneous sources[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015:1395-1404.
[69] BHATTACHARYA I, GETOOR L. A latent dirichlet model for unsupervised entity resolution[C]//Proceedings of the 2006 SIAM International Conference on Data Mining,[S.l.]:Society for Industrial and Applied Mathematics, 2006:47-58.
[70] SEN P. Collective context-aware topic models for entity disambiguation[C]//Proceedings of the 21st International Conference on World Wide Web. ACM, 2012:729-738.
[71] TANG J, FANG Z, SUN J. Incorporating social context and domain knowledge for entity recognition[C]//Proceedings of the 24th International Conference on World Wide Web.[S.l.]:International World Wide Web Conferences Steering Committee, 2015:517-526.
[72] STEORTS R C, HALL R, FIENBERG S E. A bayesian approach to graphical record linkage and deduplication[J]. Journal of the American Statistical Association, 2016, 516(111):1660-1672.
[73] MAURYA A, TELANG R. Bayesian multi-view models for member-job matching and personalized skill recommendations[C]//Big Data, 2017 IEEE International Conference on. IEEE, 2017:1193-1202.
文章导航

/