基于多维特征表示的文本语义匹配

王明; 李特; 黄定江

doi:10.3969/j.issn.1000-5641.2022.05.011

华东师范大学学报（自然科学版） >

2022 , Vol. 2022 >Issue 5: 126 - 135

DOI: https://doi.org/10.3969/j.issn.1000-5641.2022.05.011

供应链知识图谱构建与分析

基于多维特征表示的文本语义匹配

王明 ,
李特 ,
黄定江

展开

华东师范大学数据科学与工程学院, 上海　200062

收稿日期: 2022-07-20

录用日期: 2022-07-20

网络出版日期: 2022-09-26

基金资助

国家自然科学基金(U1711262, 62072185, U1811264)

收起

Text matching based on multi-dimensional feature representation

Ming WANG ,
Te LI ,
Dingjiang HUANG

Expand

School of Data Science and Engineering, East China Normal University, Shanghai　200062, China

Received date: 2022-07-20

Accepted date: 2022-07-20

Online published: 2022-09-26

Fold

摘要

文本语义匹配是很多自然语言处理任务的基础. 在很多场景中都需要文本语义匹配技术, 如搜索、问答系统等. 在实际运用场景中, 对文本语义匹配的效率有很高的要求. 虽然表征学习型语义匹配模型相较于交互型模型的准确率有所下降, 但效率极高. 而表征学习型语义匹配模型提升性能的关键是抽取具有高层语义特征的句向量. 针对该问题, 本文在ERINE模型的基础上, 设计了特征融合模块及特征抽取模块, 以获取具有多维语义特征的句向量, 并通过设计语义预测的损失函数, 进一步提升模型获取语义信息的性能, 从而提高文本语义匹配的准确率. 最终在百度千言文本相似度数据集上的准确率达到85.1%, 表现出较好的性能.

关键词： 文本语义匹配; 预训练模型; 句向量; 语义特征

本文引用格式

王明 , 李特 , 黄定江 . 基于多维特征表示的文本语义匹配[J]. 华东师范大学学报（自然科学版）, 2022 , 2022(5) : 126 -135 . DOI: 10.3969/j.issn.1000-5641.2022.05.011

Abstract

Text semantic matching is the basis of many natural language processing tasks. Text semantic matching techniques are required in many scenarios, such as search, question, and answer systems. In practical application scenarios, the efficiency of text semantic matching is crucial. Although the representational learning semantic-matching model is less accurate than the interactive model, it is more efficient. The key to improve the performance of learning-based semantic-matching models is to extract sentence vectors with high-level semantic features. On this basis, this paper presents the design of a feature-fusion module and feature-extraction module based on the ERINE model to obtain sentence vectors with multidimensional semantic features. Further, the performance of the model is improved to obtain semantic information by designing a loss function of semantic prediction. Finally, the accuracy on the Baidu Qianyan dataset reaches 0.851, which indicates good performance.

Key words： text matching; pre-training model; sentence embeddings; semantic features

参考文献

1	YANG Y, ZHANG C. Attention-based multi-level network for text matching with feature fusion [C]// 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence. 2021: 1-7.
2	RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training [EB/OL]. [2022-07-08]. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-supervised/language_understanding_paper.pdf.
3	DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24) [2022-07-08]. https://arxiv.org/pdf/1810.04805.pdf.
4	REIMERS N, GUREVYCH I. Sentence-bert: Sentence embeddings using siamese bert-networks [EB/OL]. (2019-08-27) [2022-07-08]. https://arxiv.org/pdf/1908.10084.pdf.
5	SALTON G, WONG A, YANG C S. A vector space model for automatic indexing. Communications of the ACM, 1975, 18 (11): 613- 620.
6	MILLER G A. WordNet: A lexical database for English [C]// Proceedings of the Workshop on Human Language Technology. 1994: 468.
7	KOLTE S G, BHIRUD S G. Word sense disambiguation using wordnet domains [C]// First International Conference on Emerging Trends in Engineering and Technology. New York: IEEE, 2008: 1187-1191.
8	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [EB/OL]. (2013-09-07) [2022-07-08]. https://arxiv.org/pdf/1301.3781.pdf.
9	LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86 (11): 2278- 2324.
10	HOCHREITER S, SCHMIDHUBER J. Long short-term memory. Neural Computation, 1997, 9 (8): 1735- 1780.
11	CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling [EB/OL]. (2014-12-11) [2022-07-08]. https://arxiv.org/pdf/1412.3555.pdf.
12	MUELLER J, THYAGARAJAN A. Siamese recurrent architectures for learning sentence similarity [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2016, 30(1): 2786-2792.
13	CHEN Q, ZHU X, LING Z, et al. Enhanced LSTM for natural language inference [EB/OL]. (2017-04-26) [2022-07-08]. https://arxiv.org/pdf/1609.06038.pdf.
14	WANG Z, HAMZA W, FLORIAN R. Bilateral multi-perspective matching for natural language sentences [EB/OL]. (2017-07-14) [2022-07-08]. https://arxiv.org/pdf/1702.03814.pdf.
15	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [EB/OL]. (2017-12-06) [2022-07-08]. https://arxiv.org/pdf/1706.03762.pdf.
16	SUN Y, WANG S, LI Y, et al. Ernie: Enhanced representation through knowledge integration [EB/OL]. (2019-04-19) [2022-07-08]. https://arxiv.org/pdf/1904.09223.pdf.
17	JAWAHAR G, SAGOT B, SEDDAH D. What does BERT learn about the structure of language? [EB/OL]. (2019-06-04) [2022-07-08]. https://hal.inria.fr/hal-02131630/document.
18	GRILL J B, STRUB F, ALTCHé F, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 2020, 33, 21271- 21284.
19	LIU X, CHEN Q, DENG C, et al. Lcqmc: A large-scale chinese question matching corpus [C]// Proceedings of the 27th International Conference on Computational Linguistics. 2018: 1952-1962.
20	CHEN J, CHEN Q, LIU X, et al. The bq corpus: A large-scale domain-specific chinese corpus for sentence semantic equivalence identification [C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 4946-4951.
21	YANG Y, ZHANG Y, TAR C, et al. PAWS-X: A cross-lingual adversarial dataset for paraphrase identification [EB/OL]. (2019-08-30) [2022-07-08]. https://arxiv.org/pdf/1908.11828.pdf.
22	WEI J, REN X, LI X, et al. Nezha: Neural contextualized representation for chinese language understanding [EB/OL]. (2019-09-05) [2022-07-08]. https://arxiv.org/pdf/1909.00204v2.pdf.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献