华东师范大学学报(自然科学版) ›› 2020, Vol. 2020 ›› Issue (5): 95-112.doi: 10.3969/j.issn.1000-5641.202091011
韩程程, 李磊, 刘婷婷, 高明
收稿日期:
2020-08-09
发布日期:
2020-09-24
通讯作者:
高明,男,教授,博士生导师,研究方向为教育计算、知识图谱、知识工程、用户画像、社会网络挖掘、不确定数据管理.E-mail:mgao@dase.ecnu.edu.cn
E-mail:mgao@dase.ecnu.edu.cn
基金资助:
HAN Chengcheng, LI Lei, LIU Tingting, GAO Ming
Received:
2020-08-09
Published:
2020-09-24
摘要: 综述了语义文本相似度计算的最新研究进展, 主要包括基于字符串、基于统计、基于知识库和基于深度学习的方法. 针对每一类方法, 不仅介绍了其中典型的模型和方法, 而且深入探讨了各类方法的优缺点; 并对该领域的常用公开数据集和评估指标进行了整理, 最后讨论并总结了该领域未来可能的研究方向.
中图分类号:
韩程程, 李磊, 刘婷婷, 高明. 语义文本相似度计算方法[J]. 华东师范大学学报(自然科学版), 2020, 2020(5): 95-112.
HAN Chengcheng, LI Lei, LIU Tingting, GAO Ming. Approaches for semantic textual similarity[J]. Journal of East China Normal University(Natural Science), 2020, 2020(5): 95-112.
[1] BLOEHDORN S, BASILI R, CAMMISA M, et al. Semantic kernels for text classification based on topological measures of feature similarity [C]//Proceeding of the Sixth International Conference on Data Mining (ICDM’06). 2006: 808-812. [2] TONG Y, GU L. A news text clustering method based on similarity of text labels [C]//International Conference on Advanced Hybrid Information Processing. 2018: 496-503. [3] ATTARDI G, SIMI M, DEI R S. TANL-1: Coreference resolution by parse analysis and similarity clustering [C]//Proceedings of the 5th International Workshop on Semantic Evaluation. 2010: 108-111. [4] DAS A, MANDAL J, DANIAL Z, et al. A novel approach for automatic bengali question answering system using semantic similarity analysis[EB/OL]. (2019-10-23)[2020-07-01]. https://arxiv.org/ftp/arxiv/papers/1910/1910.10758.pdf. [5] AMIR S, TANASESCU A, ZIGHED D A. Sentence similarity based on semantic kernels for intelligent text retrieval [J]. Journal of Intelligent Information Systems, 2017, 48(3): 675-689. [6] SOORI H, PRILEPOK M, PLATOS J, et al. Semantic and similarity measure methods for plagiarism detection of students’ assignments [C]//Proceedings of the Second International Afro-European Conference for Industrial Advancement AECIA 2015. 2016: 117-125. [7] VADAPALLI R, KURISINKEL L J, GUPTA M, et al. SSAS: Semantic similarity for abstractive summarization [C]//Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2017: 198-203. [8] QIAN M, LIU J, LI C, et al. A comparative study of English-Chinese translations of court texts by machine and human translators and the Word2Vec based similarity measure’s ability to gauge human evaluation biases [C]//Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks. 2019: 95-100. [9] MAJUMDER G, PAKRAY P, GELBUKH A, et al. Semantic textual similarity methods, tools, and applications: A survey [J]. Computación y Sistemas, 2016, 20(4): 647-665. [10] 王春柳, 杨永辉, 邓霏, 等. 文本相似度计算方法研究综述 [J]. 情报科学, 2019, 37(3): 158-168 [11] RISTAD, ERIC S, YIANILOS, et al. Learning string-edit distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(5): 522-532. [12] XU X, CHEN L, HE P. Fast sequence similarity computing with LCS on LARPBS [C]//International Symposium on Parallel and Distributed Processing and Applications. 2005: 168-175. [13] KONDRAK G. N-gram similarity and distance [C]// String Processing and Information Retrieval. 2005: 115-126. [14] NIWATTANAKUL S, SINGTHONGCHAI J, NAENUDORN E, et al. Using of Jaccard Coefficient for Keywords Similarity [J]. Lecture Notes in Engineering and Computer Science, 2013, 1(3): 13-15. [15] 车万翔, 刘挺, 秦兵, 等. 基于改进编辑距离的中文相似句子检索 [J]. 高技术通讯, 2004, 14(7): 15-19 [16] SLANEY M, CASEY M. Locality-sensitive hashing for finding nearest neighbors [J]. IEEE Signal processing magazine, 2008, 25(2): 128-131. [17] SALTON G, WONG A, YANG C S, et al. A vector space model for automatic indexing [J]. Communications of The ACM, 1975, 18(11): 613-620. [18] LANDAUER T K, DUMAIS S T. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge [J]. Psychological Review, 1997, 104(2): 211-240. [19] HOFMANN T. Probabilistic latent semantic analysis [J]. Uncertainty in Artificial Intelligence, 1999, 15(6): 289-296. [20] BLEI D M, NG A Y, JORDAN M I, et al. Latent dirichlet allocation [J]. Journal of Machine Learning Research, 2012(3): 993-1022. [21] GUO Q L, LI Y M, TANG Q. Similarity computing of documents based on VSM [J]. Application Research of Computers, 2008, 25(11): 3256-3258. [22] LI L. Research and implementation of an improved VSM-based text similarity algorithm [J]. Computer Applications and Software, 2012, 29(2): 282-284. [23] TASI C, HUANG Y, LIU C, et al. Applying VSM and LCS to develop an integrated text retrieval mechanism [J]. Expert Systems With Applications, 2012, 39(4): 3974-3982. [24] 王振振, 何明, 杜永萍. 基于LDA主题模型的文本相似度计算 [J]. 计算机科学, 2013, 40(12): 229-232 [25] XIONG D P, WANG J, LIN H F. An LDA-based approach to finding similar questions for community question answer [J]. Journal of Chinese Information Processing, 2012, 26(5): 40-45. [26] ZHANG C, CHEN L, LI X, et al. Chinese text similarity algorithm based on PST_LDA [J]. Application Research of Computers, 2016, 33(2): 375-377. [27] MIAO Y, YU L, BLUNSOM P, et al. Neural variational inference for text processing [EB/OL]. (2016-01-04)[2020-07-01]. https://arxiv.org/pdf/1511.06038.pdf. [28] LAU J H, BALDWIN T, COHN T, et al. Topically Driven Neural Language Model [C]// Meeting of the Association for Computational Linguistics. 2017: 355-365. [29] MILLER, GEORGE A. WordNet: A lexical database for English [J]. Communications of the Acm, 1995, 38(11): 39-41. [30] 梅家驹, 竺一鸣, 高蕴琦, 等. 同义词词林 [M]. 上海: 上海辞书出版社, 1983. [31] 董振东. 语义关系的表达和知识系统的建造 [J]. 语言文字应用, 1998(3): 76-82 [32] RADA R, MILI H, BICKNELL E J, et al. Development and application of a metric on semantic nets [J]. IEEE Transaction on System Man & Cybernetics, 1989, 19(1):17-30. [33] RICHARDSON R, SMEATON A F. Using WordNet in a knowledge-based approach to information retrieval [EB/OL]. (1995-02-01)[2020-07-01].http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=0DDA60E11D37A7DA2777BF162C86760F?doi=10.1.1.48.9324&rep=rep1&type=pdf. [34] LEACOCK C, CHODOROW M. Combining local context and WordNet similarity for word sense identification [M]// FELLBAUM C. WordNet: An Electronic Lexical Database. Massachusetts: MIT Press, 1998. [35] WU Z B. Verb semantics and lexical selection[C]// Acl Proceedings of Annual Meeting on Association for Computational Linguistics. 1994: 133-138. [36] HIRST G, STONGE D. Lexical chains as representations of context for the detection and correction of malapropisms[M]// FELLBAUM C. WordNet: An Electronic Lexical Database. Massachusetts: MIT Press, 1998, 305: 305-332. [37] YANG D, POWERS D M W. Measuring semantic similarity in the taxonomy of WordNet [C]// ACSC’05: Proceedings of the Twenty-eighth Australasian conference on Computer Science. 2005, 38: 315-322. [38] RESNIK P. Using information content to evaluate semantic similarity in a taxonomy [C]// IJCAI’95: Proceedings of the 14th International Joint Conference on Artificial Intelligence. 1995(1): 448-453. [39] JIANG J J, CONRATH D W. Semantic similarity based on corpus statistics and lexical taxonomy [EB/OL]. (1997-10-01)[2020-07-01]. https://arxiv.org/pdf/cmp-lg/9709008.pdf. [40] LIN D. An information-theoretic definition of similarity [C]//ICML’98: Proceedings of the Fifteenth International Conference on Machine Learning. 1998(7): 296-304. [41] LESK M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone [C]//Proceedings of the 5th Annual International Conference on Systems Documentation. 1986: 24-26. [42] BANERJEE S, PEDERSEN T. An adapted lesk algorithm for word sense disambiguation using WordNet [C]//International Conference on Intelligent Text Processing and Computational Linguistics. 2002: 136-145. [43] PEDERSEN T, PATWARDHAN S, MICHELIZZI J. WordNet : Similarity-Measuring the relatedness of concepts [C]//Demonstrations’04: Demonstration Papers at HLT-NAACL 2004. 2004(5): 38-41. [44] LI Y, BANDAR Z A, MCLEAN D. An approach for measuring semantic similarity between words using multiple information sources [J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 871-882. [45] SHI B, FANG L Y, YAN J Z, et al. Ontology-based measure of semantic similarity between concepts [C]//WCSE '09: Proceedings of the 2009 WRI World Congress on Software Engineering. 2009(2): 109-112. [46] 郑志蕴, 阮春阳, 李伦, 等. 本体语义相似度自适应综合加权算法研究 [J]. 计算机科学, 2016, 43: 242-247 [47] 刘群, 李素建. 基于《知网》 的词汇语义相似度计算 [J]. 中文计算语言学, 2002, 7(2): 59-76. [48] 李峰, 李芳. 中文词语语义相似度计算——基于《知网》2000 [J]. 中文信息学报, 2007, 21(3): 99-105 [49] 江敏, 肖诗斌, 王弘蔚, 等. 一种改进的基于《知网》的词语语义相似度计算 [J]. 中文信息学报, 2008, 22(5): 84-89 [50] STRUBE M, PONZETTO S P. WikiRelate! Computing semantic relatedness using Wikipedia [C]//AAAI'06: Proceedings of the 21st National Conference on Artificial Intelligence. 2006(2): 1419-1424. [51] GABRILOVICH E, MARKOVITCH S. Computing semantic relatedness using wikipedia-based explicit semantic analysis [C]//IJCAI’07: Proceedings of the 20th International Joint Conference on Artifical Intelligence. 2007(1): 1606-1611. [52] WITTEN I, MILNE D N. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links [C]//Proceedings of AAAI’2008. 2008: 25-30. [53] YEH E, RAMAGE D, MANNING C D, et al. WikiWalk: Random walks on Wikipedia for semantic relatedness [C]//Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing. 2009: 41-49. [54] CAMACHO-COLLADOS J, PILEHVAR M T, NAVIGLI R. Nasari: A novel approach to a semantically-aware representation of items [C]//Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 567-577. [55] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [EB/OL]. (2013-09-07)[2020-07-01]. https://arxiv.org/pdf/1301.3781.pdf. [56] PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543. [57] JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification [EB/OL]. (2016-08-09)[2020-07-01]. https://arxiv.org/pdf/1607.01759.pdf. [58] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations [EB/OL]. (2018-03-22)[2020-07-01]. https://arxiv.org/pdf/1802.05365.pdf. [59] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training [EB/OL]. (2018-11-05)[2020-07-01]. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/languageunsupervised/language understanding paper.pdf. [60] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]//NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 5998-6008. [61] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24)[2020-07-01]. https://arxiv.org/pdf/1810.04805.pdf. [62] LE Q, MIKOLOV T. Distributed representations of sentences and documents [C]//ICML’14: Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014, 32: 1188-1196. [63] PAGLIARDINI M, GUPTA P, JAGGI M. Unsupervised learning of sentence embeddings using compositional n-gram features [EB/OL]. (2018-12-28)[2020-07-01]. https://arxiv.org/pdf/1703.02507.pdf. [64] KIROS R, ZHU Y, SALAKHUTDINOV R R, et al. Skip-thought vectors [C]//Advances in neural information processing systems. 2015: 3294-3302. [65] LOGESWARAN L, LEE H. An efficient framework for learning sentence representations [EB/OL]. (2018-03-07)[2020-07-01]. https://arxiv.org/pdf/1803.02893.pdf. [66] HILL F, CHO K, KORHONEN A. Learning distributed representations of sentences from unlabelled data [EB/OL]. (2016-02-10)[2020-07-01]. https://arxiv.org/pdf/1602.03483.pdf. [67] KUSNER M, SUN Y, KOLKIN N, et al. From word embeddings to document distances [C]//International Conference on Machine Learning. 2015: 957-966. [68] ARORA S, LIANG Y, MA T. A simple but tough-to-beat baseline for sentence embeddings [EB/OL]. (2017-02-04)[2020-07-01]. https://openreview.net/pdf?id=SyK00v5xx. [69] RÜCKLÉ A, EGER S, PEYRARD M, et al. Concatenated power mean word embeddings as universal cross-lingual sentence representations [EB/OL]. (2018-09-12)[2020-07-01]. https://arxiv.org/pdf/1803.01400.pdf. [70] HUANG P S, HE X, GAO J, et al. Learning deep structured semantic models for web search using clickthrough data [C]//Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2013: 2333-2338. [71] SHEN Y, HE X, GAO J, et al. A latent semantic model with convolutional-pooling structure for information retrieval[C]//Proceedings of the 23rd ACM international conference on conference on information and knowledge management. 2014: 101-110. [72] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks [C]//Advances in nNeural Information Processing Systems. 2012: 1097-1105. [73] PALANGI H, DENG L, SHEN Y, et al. Semantic modelling with long-short-term memory for information retrieval [EB/OL]. (2015-02-27)[2020-07-01]. https://arxiv.org/pdf/1412.6629.pdf. [74] GERS F. Long short-term memory in recurrent neural networks [D]. Lausanne: EPFL, 2001. [75] PONTES E L, HUET S, LINHARES A C, et al. Predicting the semantic textual similarity with siamese CNN and LSTM [EB/OL]. (2018-10-24)[2020-07-01]. https://arxiv.org/pdf/1810.10641.pdf. [76] MUELLER J, THYAGARAJAN A. Siamese recurrent architectures for learning sentence similarity [C]//AAAI’16: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 2016(2): 2786-2792. [77] LIN Z, FENG M, SANTOS C N, et al. A structured self-attentive sentence embedding [EB/OL]. (2017-03-09)[2020-07-01]. https://arxiv.org/pdf/1703.03130.pdf. [78] CONNEAU A, KIELA D, SCHWENK H, et al. Supervised learning of universal sentence representations from natural language inference data [EB/OL]. (2017-07-21)[2020-07-01]. https://arxiv.org/pdf/1705.02364v4.pdf. [79] YIN W, SCHÜTZE H, XIANG B, et al. Abcnn: Attention-based convolutional neural network for modeling sentence pairs [J]. Transactions of the Association for Computational Linguistics, 2016(4): 259-272. [80] HE H, LIN J. Pairwise word interaction modeling with deep neural networks for semantic similarity measurement [C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016: 937-948. [81] WANG Z, HAMZA W, FLORIAN R. Bilateral multi-perspective matching for natural language sentences [C]//Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence Main track. 2017: 4144-4150. [82] GONG Y, LUO H, ZHANG J. Natural language inference over interaction space [EB/OL]. (2018-05-26)[2020-07-01]. https://arxiv.org/pdf/1709.04348.pdf. [83] KIM S, KANG I, KWAK N. Semantic sentence matching with densely-connected recurrent and co-attentive information [C]//Proceedings of the AAAI conference on artificial intelligence. 2019, 33: 6586-6593. [84] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4700-4708. [85] YANG Y, YUAN S, CER D, et al. Learning semantic textual similarity from conversations [EB/OL]. (2018-04-20)[2020-07-01]. https://arxiv.org/pdf/1804.07754.pdf. [86] CER D, YANG Y, KONG S, et al. Universal sentence encoder [EB/OL]. (2018-04-12)[2020-07-01]. https://arxiv.org/pdf/1803.11175.pdf. [87] CHEN G, SHI X, CHEN M, et al. Text similarity semantic calculation based on deep reinforcement learning [J]. International Journal of Security and Networks, 2020, 15(1): 59-66. |
[1] | 徐秋荣, 朱鹏, 罗轶凤, 董启文. 金融领域中文命名实体识别研究进展[J]. 华东师范大学学报(自然科学版), 2021, 2021(5): 1-13. |
[2] | 刘波, 白晓东, 张更新, 沈俊, 谢继东, 赵来定, 洪涛. 深度学习在认知无线电中的应用研究综述[J]. 华东师范大学学报(自然科学版), 2021, 2021(1): 36-52. |
[3] | 张旭, 黄定江. 基于深度学习的铝材表面缺陷检测[J]. 华东师范大学学报(自然科学版), 2020, 2020(6): 105-114. |
[4] | 穆肇南, 刘梦珠, 孙界平, 王成. 基于演化算法的唐诗自动生成系统研究[J]. 华东师范大学学报(自然科学版), 2020, 2020(6): 129-139. |
[5] | 王嘉宁, 何怡, 朱仁煜, 刘婷婷, 高明. 基于远程监督的关系抽取技术[J]. 华东师范大学学报(自然科学版), 2020, 2020(5): 113-130. |
[6] | 郭晓哲, 彭敦陆, 张亚彤, 彭学桂. GRS: 一种面向电商领域智能客服的生成-检索式对话模型[J]. 华东师范大学学报(自然科学版), 2020, 2020(5): 156-166. |
[7] | 刘恒宇, 张天成, 武培文, 于戈. 知识追踪综述[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 1-15. |
[8] | 陈远哲, 匡俊, 刘婷婷, 高明, 周傲英. 共指消解技术综述[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 16-35. |
[9] | 杨康, 黄定江, 高明. 面向自动问答的机器阅读理解综述[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 36-52. |
[10] | 叶健, 赵慧. 基于大规模弹幕数据监听和情感分类的舆情分析模型[J]. 华东师范大学学报(自然科学版), 2019, 2019(3): 86-100. |
[11] | 余若男, 黄定江, 董启文. 基于深度学习的场景文字检测研究进展[J]. 华东师范大学学报(自然科学版), 2018, 2018(5): 1-16. |
[12] | 张衡, 陈良育. Levenshtein算法优化及在题库判重中的应用[J]. 华东师范大学学报(自然科学版), 2018, 2018(5): 154-163. |
[13] | 袁培森, 张勇, 李美玲, 顾兴健. 基于深度哈希学习的商标图像检索研究[J]. 华东师范大学学报(自然科学版), 2018, 2018(5): 172-182. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||