语义文本相似度计算方法

doi:10.3969/j.issn.1000-5641.202091011

摘要/Abstract

摘要： 综述了语义文本相似度计算的最新研究进展, 主要包括基于字符串、基于统计、基于知识库和基于深度学习的方法. 针对每一类方法, 不仅介绍了其中典型的模型和方法, 而且深入探讨了各类方法的优缺点; 并对该领域的常用公开数据集和评估指标进行了整理, 最后讨论并总结了该领域未来可能的研究方向.

关键词: 文本相似度, 语义相似度, 自然语言处理, 知识库, 深度学习

Abstract: This paper summarizes the latest research progress on semantic textual similarity calculation methods, including string-based, statistics-based, knowledge-based, and deep-learning-based methods. For each method, the paper reviews not only typical models and approaches, but also discusses the respective advantages and disadvantages of each routine; the paper also explores public datasets and evaluation metrics commonly used. Finally, we put forward several possible directions for future research in the field of semantic textual similarity.

Key words: textual similarity, semantic similarity, natural language processing, knowledge base, deep learning

中图分类号:

TP311

韩程程, 李磊, 刘婷婷, 高明. 语义文本相似度计算方法[J]. 华东师范大学学报（自然科学版）, 2020, 2020(5): 95-112.

HAN Chengcheng, LI Lei, LIU Tingting, GAO Ming. Approaches for semantic textual similarity[J]. Journal of East China Normal University(Natural Science), 2020, 2020(5): 95-112.

参考文献

[1] BLOEHDORN S, BASILI R, CAMMISA M, et al. Semantic kernels for text classification based on topological measures of feature similarity [C]//Proceeding of the Sixth International Conference on Data Mining (ICDM’06). 2006: 808-812.
[2] TONG Y, GU L. A news text clustering method based on similarity of text labels [C]//International Conference on Advanced Hybrid Information Processing. 2018: 496-503.
[3] ATTARDI G, SIMI M, DEI R S. TANL-1: Coreference resolution by parse analysis and similarity clustering [C]//Proceedings of the 5th International Workshop on Semantic Evaluation. 2010: 108-111.
[4] DAS A, MANDAL J, DANIAL Z, et al. A novel approach for automatic bengali question answering system using semantic similarity analysis[EB/OL]. (2019-10-23)[2020-07-01]. https://arxiv.org/ftp/arxiv/papers/1910/1910.10758.pdf.
[5] AMIR S, TANASESCU A, ZIGHED D A. Sentence similarity based on semantic kernels for intelligent text retrieval [J]. Journal of Intelligent Information Systems, 2017, 48(3): 675-689.
[6] SOORI H, PRILEPOK M, PLATOS J, et al. Semantic and similarity measure methods for plagiarism detection of students’ assignments [C]//Proceedings of the Second International Afro-European Conference for Industrial Advancement AECIA 2015. 2016: 117-125.
[7] VADAPALLI R, KURISINKEL L J, GUPTA M, et al. SSAS: Semantic similarity for abstractive summarization [C]//Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2017: 198-203.
[8] QIAN M, LIU J, LI C, et al. A comparative study of English-Chinese translations of court texts by machine and human translators and the Word2Vec based similarity measure’s ability to gauge human evaluation biases [C]//Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks. 2019: 95-100.
[9] MAJUMDER G, PAKRAY P, GELBUKH A, et al. Semantic textual similarity methods, tools, and applications: A survey [J]. Computación y Sistemas, 2016, 20(4): 647-665.
[10] 王春柳, 杨永辉, 邓霏, 等. 文本相似度计算方法研究综述 [J]. 情报科学, 2019, 37(3): 158-168
[11] RISTAD, ERIC S, YIANILOS, et al. Learning string-edit distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(5): 522-532.
[12] XU X, CHEN L, HE P. Fast sequence similarity computing with LCS on LARPBS [C]//International Symposium on Parallel and Distributed Processing and Applications. 2005: 168-175.
[13] KONDRAK G. N-gram similarity and distance [C]// String Processing and Information Retrieval. 2005: 115-126.
[14] NIWATTANAKUL S, SINGTHONGCHAI J, NAENUDORN E, et al. Using of Jaccard Coefficient for Keywords Similarity [J]. Lecture Notes in Engineering and Computer Science, 2013, 1(3): 13-15.
[15] 车万翔, 刘挺, 秦兵, 等. 基于改进编辑距离的中文相似句子检索 [J]. 高技术通讯, 2004, 14(7): 15-19
[16] SLANEY M, CASEY M. Locality-sensitive hashing for finding nearest neighbors [J]. IEEE Signal processing magazine, 2008, 25(2): 128-131.
[17] SALTON G, WONG A, YANG C S, et al. A vector space model for automatic indexing [J]. Communications of The ACM, 1975, 18(11): 613-620.
[18] LANDAUER T K, DUMAIS S T. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge [J]. Psychological Review, 1997, 104(2): 211-240.
[19] HOFMANN T. Probabilistic latent semantic analysis [J]. Uncertainty in Artificial Intelligence, 1999, 15(6): 289-296.
[20] BLEI D M, NG A Y, JORDAN M I, et al. Latent dirichlet allocation [J]. Journal of Machine Learning Research, 2012(3): 993-1022.
[21] GUO Q L, LI Y M, TANG Q. Similarity computing of documents based on VSM [J]. Application Research of Computers, 2008, 25(11): 3256-3258.
[22] LI L. Research and implementation of an improved VSM-based text similarity algorithm [J]. Computer Applications and Software, 2012, 29(2): 282-284.
[23] TASI C, HUANG Y, LIU C, et al. Applying VSM and LCS to develop an integrated text retrieval mechanism [J]. Expert Systems With Applications, 2012, 39(4): 3974-3982.
[24] 王振振, 何明, 杜永萍. 基于LDA主题模型的文本相似度计算 [J]. 计算机科学, 2013, 40(12): 229-232
[25] XIONG D P, WANG J, LIN H F. An LDA-based approach to finding similar questions for community question answer [J]. Journal of Chinese Information Processing, 2012, 26(5): 40-45.
[26] ZHANG C, CHEN L, LI X, et al. Chinese text similarity algorithm based on PST_LDA [J]. Application Research of Computers, 2016, 33(2): 375-377.
[27] MIAO Y, YU L, BLUNSOM P, et al. Neural variational inference for text processing [EB/OL]. (2016-01-04)[2020-07-01]. https://arxiv.org/pdf/1511.06038.pdf.
[28] LAU J H, BALDWIN T, COHN T, et al. Topically Driven Neural Language Model [C]// Meeting of the Association for Computational Linguistics. 2017: 355-365.
[29] MILLER, GEORGE A. WordNet: A lexical database for English [J]. Communications of the Acm, 1995, 38(11): 39-41.
[30] 梅家驹, 竺一鸣, 高蕴琦, 等. 同义词词林 [M]. 上海: 上海辞书出版社, 1983.
[31] 董振东. 语义关系的表达和知识系统的建造 [J]. 语言文字应用, 1998(3): 76-82
[32] RADA R, MILI H, BICKNELL E J, et al. Development and application of a metric on semantic nets [J]. IEEE Transaction on System Man & Cybernetics, 1989, 19(1):17-30.
[33] RICHARDSON R, SMEATON A F. Using WordNet in a knowledge-based approach to information retrieval [EB/OL]. (1995-02-01)[2020-07-01].http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=0DDA60E11D37A7DA2777BF162C86760F?doi=10.1.1.48.9324&rep=rep1&type=pdf.
[34] LEACOCK C, CHODOROW M. Combining local context and WordNet similarity for word sense identification [M]// FELLBAUM C. WordNet: An Electronic Lexical Database. Massachusetts: MIT Press, 1998.
[35] WU Z B. Verb semantics and lexical selection[C]// Acl Proceedings of Annual Meeting on Association for Computational Linguistics. 1994: 133-138.
[36] HIRST G, STONGE D. Lexical chains as representations of context for the detection and correction of malapropisms[M]// FELLBAUM C. WordNet: An Electronic Lexical Database. Massachusetts: MIT Press, 1998, 305: 305-332.
[37] YANG D, POWERS D M W. Measuring semantic similarity in the taxonomy of WordNet [C]// ACSC’05: Proceedings of the Twenty-eighth Australasian conference on Computer Science. 2005, 38: 315-322.
[38] RESNIK P. Using information content to evaluate semantic similarity in a taxonomy [C]// IJCAI’95: Proceedings of the 14th International Joint Conference on Artificial Intelligence. 1995(1): 448-453.
[39] JIANG J J, CONRATH D W. Semantic similarity based on corpus statistics and lexical taxonomy [EB/OL]. (1997-10-01)[2020-07-01]. https://arxiv.org/pdf/cmp-lg/9709008.pdf.
[40] LIN D. An information-theoretic definition of similarity [C]//ICML’98: Proceedings of the Fifteenth International Conference on Machine Learning. 1998(7): 296-304.
[41] LESK M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone [C]//Proceedings of the 5th Annual International Conference on Systems Documentation. 1986: 24-26.
[42] BANERJEE S, PEDERSEN T. An adapted lesk algorithm for word sense disambiguation using WordNet [C]//International Conference on Intelligent Text Processing and Computational Linguistics. 2002: 136-145.
[43] PEDERSEN T, PATWARDHAN S, MICHELIZZI J. WordNet : Similarity-Measuring the relatedness of concepts [C]//Demonstrations’04: Demonstration Papers at HLT-NAACL 2004. 2004(5): 38-41.
[44] LI Y, BANDAR Z A, MCLEAN D. An approach for measuring semantic similarity between words using multiple information sources [J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 871-882.
[45] SHI B, FANG L Y, YAN J Z, et al. Ontology-based measure of semantic similarity between concepts [C]//WCSE '09: Proceedings of the 2009 WRI World Congress on Software Engineering. 2009(2): 109-112.
[46] 郑志蕴, 阮春阳, 李伦, 等. 本体语义相似度自适应综合加权算法研究 [J]. 计算机科学, 2016, 43: 242-247
[47] 刘群, 李素建. 基于《知网》的词汇语义相似度计算 [J]. 中文计算语言学, 2002, 7(2): 59-76.
[48] 李峰, 李芳. 中文词语语义相似度计算——基于《知网》2000 [J]. 中文信息学报, 2007, 21(3): 99-105
[49] 江敏, 肖诗斌, 王弘蔚, 等. 一种改进的基于《知网》的词语语义相似度计算 [J]. 中文信息学报, 2008, 22(5): 84-89
[50] STRUBE M, PONZETTO S P. WikiRelate! Computing semantic relatedness using Wikipedia [C]//AAAI'06: Proceedings of the 21st National Conference on Artificial Intelligence. 2006(2): 1419-1424.
[51] GABRILOVICH E, MARKOVITCH S. Computing semantic relatedness using wikipedia-based explicit semantic analysis [C]//IJCAI’07: Proceedings of the 20th International Joint Conference on Artifical Intelligence. 2007(1): 1606-1611.
[52] WITTEN I, MILNE D N. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links [C]//Proceedings of AAAI’2008. 2008: 25-30.
[53] YEH E, RAMAGE D, MANNING C D, et al. WikiWalk: Random walks on Wikipedia for semantic relatedness [C]//Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing. 2009: 41-49.
[54] CAMACHO-COLLADOS J, PILEHVAR M T, NAVIGLI R. Nasari: A novel approach to a semantically-aware representation of items [C]//Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 567-577.
[55] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [EB/OL]. (2013-09-07)[2020-07-01]. https://arxiv.org/pdf/1301.3781.pdf.
[56] PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543.
[57] JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification [EB/OL]. (2016-08-09)[2020-07-01]. https://arxiv.org/pdf/1607.01759.pdf.
[58] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations [EB/OL]. (2018-03-22)[2020-07-01]. https://arxiv.org/pdf/1802.05365.pdf.
[59] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training [EB/OL]. (2018-11-05)[2020-07-01]. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/languageunsupervised/language understanding paper.pdf.
[60] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]//NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 5998-6008.
[61] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24)[2020-07-01]. https://arxiv.org/pdf/1810.04805.pdf.
[62] LE Q, MIKOLOV T. Distributed representations of sentences and documents [C]//ICML’14: Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014, 32: 1188-1196.
[63] PAGLIARDINI M, GUPTA P, JAGGI M. Unsupervised learning of sentence embeddings using compositional n-gram features [EB/OL]. (2018-12-28)[2020-07-01]. https://arxiv.org/pdf/1703.02507.pdf.
[64] KIROS R, ZHU Y, SALAKHUTDINOV R R, et al. Skip-thought vectors [C]//Advances in neural information processing systems. 2015: 3294-3302.
[65] LOGESWARAN L, LEE H. An efficient framework for learning sentence representations [EB/OL]. (2018-03-07)[2020-07-01]. https://arxiv.org/pdf/1803.02893.pdf.
[66] HILL F, CHO K, KORHONEN A. Learning distributed representations of sentences from unlabelled data [EB/OL]. (2016-02-10)[2020-07-01]. https://arxiv.org/pdf/1602.03483.pdf.
[67] KUSNER M, SUN Y, KOLKIN N, et al. From word embeddings to document distances [C]//International Conference on Machine Learning. 2015: 957-966.
[68] ARORA S, LIANG Y, MA T. A simple but tough-to-beat baseline for sentence embeddings [EB/OL]. (2017-02-04)[2020-07-01]. https://openreview.net/pdf?id=SyK00v5xx.
[69] RÜCKLÉ A, EGER S, PEYRARD M, et al. Concatenated power mean word embeddings as universal cross-lingual sentence representations [EB/OL]. (2018-09-12)[2020-07-01]. https://arxiv.org/pdf/1803.01400.pdf.
[70] HUANG P S, HE X, GAO J, et al. Learning deep structured semantic models for web search using clickthrough data [C]//Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2013: 2333-2338.
[71] SHEN Y, HE X, GAO J, et al. A latent semantic model with convolutional-pooling structure for information retrieval[C]//Proceedings of the 23rd ACM international conference on conference on information and knowledge management. 2014: 101-110.
[72] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks [C]//Advances in nNeural Information Processing Systems. 2012: 1097-1105.
[73] PALANGI H, DENG L, SHEN Y, et al. Semantic modelling with long-short-term memory for information retrieval [EB/OL]. (2015-02-27)[2020-07-01]. https://arxiv.org/pdf/1412.6629.pdf.
[74] GERS F. Long short-term memory in recurrent neural networks [D]. Lausanne: EPFL, 2001.
[75] PONTES E L, HUET S, LINHARES A C, et al. Predicting the semantic textual similarity with siamese CNN and LSTM [EB/OL]. (2018-10-24)[2020-07-01]. https://arxiv.org/pdf/1810.10641.pdf.
[76] MUELLER J, THYAGARAJAN A. Siamese recurrent architectures for learning sentence similarity [C]//AAAI’16: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 2016(2): 2786-2792.
[77] LIN Z, FENG M, SANTOS C N, et al. A structured self-attentive sentence embedding [EB/OL]. (2017-03-09)[2020-07-01]. https://arxiv.org/pdf/1703.03130.pdf.
[78] CONNEAU A, KIELA D, SCHWENK H, et al. Supervised learning of universal sentence representations from natural language inference data [EB/OL]. (2017-07-21)[2020-07-01]. https://arxiv.org/pdf/1705.02364v4.pdf.
[79] YIN W, SCHÜTZE H, XIANG B, et al. Abcnn: Attention-based convolutional neural network for modeling sentence pairs [J]. Transactions of the Association for Computational Linguistics, 2016(4): 259-272.
[80] HE H, LIN J. Pairwise word interaction modeling with deep neural networks for semantic similarity measurement [C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016: 937-948.
[81] WANG Z, HAMZA W, FLORIAN R. Bilateral multi-perspective matching for natural language sentences [C]//Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence Main track. 2017: 4144-4150.
[82] GONG Y, LUO H, ZHANG J. Natural language inference over interaction space [EB/OL]. (2018-05-26)[2020-07-01]. https://arxiv.org/pdf/1709.04348.pdf.
[83] KIM S, KANG I, KWAK N. Semantic sentence matching with densely-connected recurrent and co-attentive information [C]//Proceedings of the AAAI conference on artificial intelligence. 2019, 33: 6586-6593.
[84] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4700-4708.
[85] YANG Y, YUAN S, CER D, et al. Learning semantic textual similarity from conversations [EB/OL]. (2018-04-20)[2020-07-01]. https://arxiv.org/pdf/1804.07754.pdf.
[86] CER D, YANG Y, KONG S, et al. Universal sentence encoder [EB/OL]. (2018-04-12)[2020-07-01]. https://arxiv.org/pdf/1803.11175.pdf.
[87] CHEN G, SHI X, CHEN M, et al. Text similarity semantic calculation based on deep reinforcement learning [J]. International Journal of Security and Networks, 2020, 15(1): 59-66.