华东师范大学学报(自然科学版) ›› 2018, Vol. 2018 ›› Issue (5): 41-55.doi: 10.3969/j.issn.1000-5641.2018.05.004
李娜, 金冈增, 周晓旭, 郑建兵, 高明
收稿日期:
2018-07-04
出版日期:
2018-09-25
发布日期:
2018-09-26
通讯作者:
郑建兵,男,博士,研究方向为信息处理技术.E-mail:zhengjb@js.chinamobile.com.
E-mail:zhengjb@js.chinamobile.com
作者简介:
李娜,女,硕士研究生,研究方向为数据挖掘.E-mail:nali0606@foxmail.com.
基金资助:
LI Na, JIN Gang-zeng, ZHOU Xiao-xu, ZHENG Jian-bing, GAO Ming
Received:
2018-07-04
Online:
2018-09-25
Published:
2018-09-26
摘要: 互联网、物联网和云计算技术的不断融合,使得各行各业信息化程度越来越高,但同时也带来了数据碎片化的问题.数据碎片化的海量性、异构性、隐私性、相依性和低质性等特征,导致了数据可用性较差,利用这些数据难以挖掘出准确而完整的信息.为了更有效地利用数据,实体匹配、融合和消歧变得尤为重要.主要对异构网络中实体匹配算法进行了综述,对实体相似度度量和数据预处理技术进行了梳理;特别针对海量数据,概述了可扩展实体匹配方法的研究进展,综述了运用监督学习和非监督学习两类技术的实体匹配算法.
中图分类号:
李娜, 金冈增, 周晓旭, 郑建兵, 高明. 异构网络中实体匹配算法综述[J]. 华东师范大学学报(自然科学版), 2018, 2018(5): 41-55.
LI Na, JIN Gang-zeng, ZHOU Xiao-xu, ZHENG Jian-bing, GAO Ming. A survey of entity matching algorithms in heterogeneous networks[J]. Journal of East China Normal University(Natural Sc, 2018, 2018(5): 41-55.
[1] WU X, ZHU X, WU G Q, et al. Data mining with big data[J]. IEEE transactions on knowledge and data engineering, 2014, 26(1):97-107. [2] 赵国栋. 大数据时代的历史机遇:产业变革与数据科学[M]. 北京:清华大学出版社, 2013. [3] GU B, LI Z, ZHANG X, et al. The interaction between schema matching and record matching in data integration[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(1):186-199. [4] LI C, JIN L, MEHROTRA S. Supporting efficient record linkage for large data sets using mapping techniques[J]. World Wide Web, 2006, 9(4):557-584. [5] WHANG S E, GARCIA-MOLINA H. Incremental entity resolution on rules and data[J]. The VLDB Journal, 2014, 23(1):77-102. [6] GETOOR L, MACHANAVAJJHALA A. Entity resolution:theory, practice & open challenges[J]. Proceedings of the VLDB Endowment, 2012, 5(12):2018-2019. [7] TEJADA S, KNOBLOCK C A, Minton S. Learning Object Identification Rules for Information Integration[J]. Information Systems, 2001, 26(8):607-633. [8] LEITAO L, CALADO P, HERSCHEL M. Efficient and effective duplicate detection in hierarchical data[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(5):1028-1041. [9] DUNN H L. Record linkage[J]. American Journal of Public Health and the Nations Health, 1946, 36(12):1412-1416. [10] LIU S, WANG S, ZHU F, et al. Hydra:Large-scale social identity linkage via heterogeneous behavior modeling[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 2014:51-62. [11] LIU J, LEI K H, LIU J Y, et al. Ranking-based name matching for author disambiguation in bibliographic data[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 8. [12] KONG C, GAO M, XU C, et al. Entity matching across multiple heterogeneous data sources[C]//International Conference on Database Systems for Advanced Applications. Cham:Springer, 2016:133-146. [13] KEJRIWAL M, MIRANKER D P. Semi-supervised instance matching using boosted classifiers[C]//European Semantic Web Conference. Cham:Springer, 2015:388-402. [14] KONDA P, DAS S, SUGANTHAN GC P, et al. Magellan:Toward building entity matching management systems[J]. Proceedings of the VLDB Endowment, 2016, 12(9):1197-1208. [15] WHANG S E, MENESTRINA D, KOUTRIKA G, et al. Entity resolution with iterative blocking[C]//Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009:219-232. [16] KARAPIPERIS D, VERYKIOS V S. An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(4):909-921. [17] GRUENHEID A, DONG X L, SRIVASTAVA D. Incremental record linkage[J]. Proceedings of the VLDB Endowment, 2014, 9(7):697-708. [18] KIM H, LEE D. HARRA:Fast iterative hashed record linkage for large-scale data collections[C]//Proceedings of the 13th International Conference on Extending Database Technology. ACM, 2010:525-536. [19] ZHANG Y, TANG J, YANG Z, et al. Cosnet:Connecting heterogeneous social networks with local and global consistency[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015:1485-1494. [20] CHRISTEN P. A survey of indexing techniques for scalable record linkage and deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9):1537-1555. [21] WANG J, LI G, YU J X, et al. Entity matching:How similar is similar[J]. Proceedings of the VLDB Endowment, 2011, 10(4):622-633. [22] SONG Y, KIMURA T, BATJARGAL B, et al. Cross-language record linkage using word embedding driven metadata similarity measurement[C/OL]//International Semantic Web Conference (Posters & Demos). (2016-09-28)[2018-06-01]. http://ceur-ws.org/vol-1690/paper90.pdf. [23] YANG K H, PENG H T, JIANG J Y, et al. Author name disambiguation for citations using topic and web correlation[C]//International Conference on Theory and Practice of Digital Libraries. Berlin:Springer, 2008:185-196. [24] VAN Der LOO M P J. The stringdist package for approximate string matching[J]. The R Journal, 2014, 6(1):111-122. [25] LI C L, SU Y C, LIN T W, et al. Combination of feature engineering and ranking models for paper-author identification in KDD Cup 2013[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 2. [26] PELED O, FIRE M, ROKACH L, et al. Entity matching in online social networks[C]//Social Computing (SocialCom), 2013 International Conference on. IEEE, 2013:339-344. [27] LI J, LIANG X, DING W, et al. Feature engineering and tree modeling for author-paper identification challenge[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 5. [28] PELED O, FIRE M, ROKACH L, et al. Matching entities across online social networks[J]. Neurocomputing, 2016, 210(C):91-106. [29] TANG J, FONG A C M, WANG B, et al. A unified probabilistic framework for name disambiguation in digital library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6):975-987. [30] KOUDAS N, SARAWAGI S, SRIVASTAVA D. Record linkage:similarity measures and algorithms[C]//Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM, 2006:802-803. [31] VESDAPUNT N, GARCIA-MOLINA H. Identifying users in social networks with limited information[C]//Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 2015:627-638. [32] LEE J Y, HUSSAIN R, RIVERA V, et al. Second-level degree-based entity resolution in online social networks[J/OL]. Social Network Analysis and Mining, 2018, 8:19. (2018-03-16)[2018-06-01]. https://link.springer.com/content/pdf/10.1007%2Fs13278-018-0499-9.pdf. [33] ZHOU X, LIANG X, ZHANG H, et al. Cross-platform identification of anonymous identical users in multiple social media networks[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(2):411-424. [34] ZHANG J, YU P S, ZHOU Z H. Meta-path based multi-network collective link prediction[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014:1286-1295. [35] LEVIN F H, HEUSER C A. Evaluating the use of social networks in author name disambiguation in digital libraries[J]. Journal of Information and Data Management, 2010(2):183-197. [36] POOJA K M, MONDAL S, CHANDRA J. An unsupervised heuristic based approach for author name disambiguation[C]//Communication Systems and Networks (COMSNETS), 201810th International Conference on. IEEE, 2018:540-542. [37] ZHONG E, LI L, WANG N, et al. Contextual rule-based feature engineering for author-paper identification[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013:Article No 6. [38] PEROZZI B, AL-RFOU r, SKIENA S. Deepwalk:Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014:701-710. [39] GROVER A, LESKOVEC J. node2vec:Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016:855-864. [40] TANG J, QU M, WANG M, et al. Line:Large-scale information network embedding[C]//Proceedings of the 24th International Conference on World Wide Web, International World Wide Web, Conferences Steering Committee, 2015:1067-1077. [41] ZHOU X, LIANG X, DU X, et al. Structure based user identification across social networks[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(6):1178-1191. [42] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv:1301.3781v3[cs.CV] 7 Sep 2013. [43] LIU S, WANG S, ZHU F. Structured learning from heterogeneous behavior for social identity linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(7):2005-2019. [44] DEY D. Entity matching in heterogeneous databases:A logistic regression approach[J]. Decision Support Systems, 2008, 44(3):740-747. [45] ZAFARANI R, LIU H. Connecting users across social media sites:a behavioral-modeling approach[C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2013:41-49. [46] ZHOU Y, HOWROYD J, DANICIC S, et al. Extending naive bayes classifier with hierarchy feature level information for record linkage[C]//Workshop on Advanced Methodologies for Bayesian Networks. Cham:Springer, 2015:93-104. [47] LI L, LI J, GAO H. Rule-based method for entity resolution[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(1):250-263. [48] CHIN W S, ZHUANG Y, JUAN Y C, et al. Effective string processing and matching for author disambiguation[J]. The Journal of Machine Learning Research, 2014, 15(1):3037-3064. [49] JIANG Y, LIN C, MENG W, et al. Rule-based deduplication of article records from bibliographic databases[J]. Database, 2014:Article ID bat086. DOI:10.1093/database/bat086. [50] SETOGUCHI S, ZHU Y, JALBERT J J, et al. Validity of deterministic record linkage using multiple indirect personal identifiers:linking a large registry to claims data[J]. Circulation:Cardiovascular Quality and Outcomes, 2014, 7(3):475-480. [51] EKTEFA M, JABAR M A, Sidi F, et al. A threshold-based similarity measure for duplicate detection[C]//Open Systems (ICOS), 2011 IEEE Conference on. IEEE, 2011:37-41. [52] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967:281-297. [53] MONIKA S, ANAND C, GNANAMURTHY R K. Analyzing the User Profile Linkage across Different Social Network Platforms[J]. International Journal of Computer Science and Engineering, 2016, 4(2):1378-1383. [54] CEN L, DRAGUT E C, SI L, et al. Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2013:741-744. [55] SONG Y, HUANG J, COUNCILL I G, et al. Efficient topic-based unsupervised name disambiguation[C]//Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2007:342-351. [56] QIAN Y, HU Y, CUI J, et al. Combining machine learning and human judgment in author disambiguation[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 2011:1241-1246. [57] FELLEGI I P, SUNTER A B. A theory for record linkage[J]. Journal of the American Statistical Association, 1969, 64(328):1183-1210. [58] WINKLER W E. Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage[C]//Proceedings of the Section on Survey Research Methods, American Statistical Association. 1988:667-671. [59] BOYD S, VANDENBERGHE L. Convex Optimization[M]. Cambridge:Cambridge University Press, 2004. [60] GAO M, LIM E P, LO D, et al. CNL:Collective network linkage across heterogeneous social platforms[C]//IEEE International Conference on Data Mining. IEEE Computer Society, 2015:757-762. [61] YAKOUT M, ELMAGARMID A K, Elmeleegy H, et al. Behavior based record linkage[J]. Proceedings of the VLDB Endowment, 2010, 3(1/2):439-448. [62] FANG Y, CHANG M W. Entity linking on microblogs with spatial and temporal signals[J]. Transactions of the Association for Computational Linguistics, 2014, 2(1):259-272. [63] HAN H, XU W, ZHA H, et al. A hierarchical naive Bayes mixture model for name disambiguation in author citations[C]//Proceedings of the 2005 ACM Symposium on Applied Computing. ACM, 2005:1065-1069. [64] SINGLA P, DOMINGOS P. Collective object identification[C]//International Joint Conference on Artificial Intelligence.[S.l.]. Morgan Kaufmann Publishers Inc, 2005:1636-1637. [65] BALAJI J, MIN C, JAVED F, et al. Avatar:Large scale entity resolution of heterogeneous user profiles[C]//Proceedings of the 2nd Workshop on Data Management for End-To-End Machine Learning. ACM, 2018:Article No 2. [66] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of machine Learning Research, 2003(3):993-1022. [67] ZHAO W X, JIANG J, WENG J, et al. Comparing twitter and traditional media using topic models[C]//European Conference on Information Retrieval. Berlin:Springer, 2011:338-349. [68] YANG Y, SUN Y, TANG J, et al. Entity matching across heterogeneous sources[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015:1395-1404. [69] BHATTACHARYA I, GETOOR L. A latent dirichlet model for unsupervised entity resolution[C]//Proceedings of the 2006 SIAM International Conference on Data Mining,[S.l.]:Society for Industrial and Applied Mathematics, 2006:47-58. [70] SEN P. Collective context-aware topic models for entity disambiguation[C]//Proceedings of the 21st International Conference on World Wide Web. ACM, 2012:729-738. [71] TANG J, FANG Z, SUN J. Incorporating social context and domain knowledge for entity recognition[C]//Proceedings of the 24th International Conference on World Wide Web.[S.l.]:International World Wide Web Conferences Steering Committee, 2015:517-526. [72] STEORTS R C, HALL R, FIENBERG S E. A bayesian approach to graphical record linkage and deduplication[J]. Journal of the American Statistical Association, 2016, 516(111):1660-1672. [73] MAURYA A, TELANG R. Bayesian multi-view models for member-job matching and personalized skill recommendations[C]//Big Data, 2017 IEEE International Conference on. IEEE, 2017:1193-1202. |
[1] | 付瑞, 李剑宇, 王笳辉, 岳昆, 胡矿. 面向领域知识图谱的实体关系联合抽取[J]. 华东师范大学学报(自然科学版), 2021, 2021(5): 24-36. |
[2] | 郑思达, 刘岩, 杨晓坤, 戚成飞, 袁培森. 基于自适应竞争的均衡优化电力系统客户分类[J]. 华东师范大学学报(自然科学版), 2021, 2021(5): 146-156. |
[3] | 赵红成, 田秀霞, 杨泽森, 白万荣. YOLO-S: 一种新型轻量的安全帽佩戴检测模型[J]. 华东师范大学学报(自然科学版), 2021, 2021(5): 134-145. |
[4] | 杨梦晨, 陈旭栋, 蔡鹏, 倪葎. 早期时间序列分类方法研究综述[J]. 华东师范大学学报(自然科学版), 2021, 2021(5): 115-133. |
[5] | 马晓琴, 薛晓慧, 罗红郊, 刘通宇, 袁培森. 基于t-LeNet与时间序列分类的窃电行为检测[J]. 华东师范大学学报(自然科学版), 2021, 2021(5): 104-114. |
[6] | 刘天弼, 冯瑞. 一种基于先验标记特征的精准图像配准算法[J]. 华东师范大学学报(自然科学版), 2021, 2021(3): 65-77. |
[7] | 祝鸣, 沈建华, 汪家财. 可扩展物联网教学开发系统的设计与实现[J]. 华东师范大学学报(自然科学版), 2021, 2021(3): 78-95. |
[8] | 梁艳春, 房爱莲. 基于多通道卷积神经网络的中文文本关系抽取[J]. 华东师范大学学报(自然科学版), 2021, 2021(3): 96-104. |
[9] | 窦建凯, 林欣, 胡文心. 单图中的近似频繁子图挖掘算法[J]. 华东师范大学学报(自然科学版), 2019, 2019(6): 73-87. |
[10] | 陈远哲, 匡俊, 刘婷婷, 高明, 周傲英. 共指消解技术综述[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 16-35. |
[11] | 杨康, 黄定江, 高明. 面向自动问答的机器阅读理解综述[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 36-52. |
[12] | 邵明锐, 马登豪, 陈跃国, 覃雄派, 杜小勇. 基于社区问答数据迁移学习的FAQ问答模型研究[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 74-84. |
[13] | 陈亮, 郭佳雯, 武建功, 王占全, 史令. 基于法计算学理论的人工智能辅助决策算法研究[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 85-99. |
[14] | 傅裕, 李优, 林煜明, 周娅. 基于自注意力机制的冗长商品名称精简方法[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 113-122,167. |
[15] | 黄福兴, 周广山, 丁宏, 张罗平, 钱淑韵, 袁培森. 基于孤立森林算法的电能量异常数据检测[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 123-132. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||