基于多核支持向量机的句子分类算法

doi:10.3969/j.issn.1000-5641.2023.00.008

摘要/Abstract

摘要：

主流句子分类算法采用单一词向量表示模型获得文本表示, 导致了对文本的映射能力不足. 对此, 通过融合多种词向量的文本表示以提高分类的准确率. 针对多核学习在融合不同核函数时, 常规的核函数系数寻优方法存在的训练时间长、难以求得局部最优解等问题, 提出了一种新的核函数系数寻优方法, 该方法基于参数空间分割与广度优先搜索不断逼近核系数的最优值. 以支持向量机(support vector machine, SVM)为分类器, 在7个文本数据集上进行了分类实验. 实验结果表明, 多核学习分类效果明显优于单核学习, 并且所提出的寻优方法在训练次数少于常规方法时也能获得了好的分类效果.

关键词: 自然语言处理, 句子分类, 多核学习, 支持向量机, 混合核

Abstract:

Mainstream sentence classification algorithms rely on a single word vector model to obtain the feature vector representation of text, which leads to insufficient text mapping ability. Therefore, a multi-kernel learning method is used to fuse multiple text representations based on different word vectors to improve the accuracy of sentence classification. In the process of fusing different kernel functions, traditional kernel function coefficient optimization methods often lead to long training time and difficulty in finding a local optimum. To address this problem, a new kernel function coefficient optimization method that continuously approximates the optimal kernel function coefficient value based on parameter space segmentation and breadth first search was developed. In this study, a support vector machine (SVM) was used as a classifier to perform classification experiments on seven text datasets, and the experimental results showed that the multi-kernel learning classification results were significantly better than those of single-kernel learning. Moreover, the proposed optimization method performed better than traditional methods with less training cost.

Key words: natural language processing, sentence classification, multi-kernel learning, support vector machine (SVM), mixed kernel

中图分类号:

TP391.1

肖开研, 廉洁. 基于多核支持向量机的句子分类算法[J]. 华东师范大学学报（自然科学版）, 2023, 2023(6): 85-94.

Kaiyan XIAO, Jie LIAN. Sentence classification algorithm based on multi-kernel support vector machine[J]. Journal of East China Normal University(Natural Science), 2023, 2023(6): 85-94.

图/表 8

图1

图2

图3

图4

表1

表2

图5

表3

参考文献 29

1	张建, 严珂, 马祥.. 基于神经网络的复杂垃圾信息过滤算法分析. 计算机应用, 2022, 42 (3): 770.
2	王曙燕, 原柯.. 基于RoBERTa-WWM的大学生论坛情感分析模型. 计算机工程, 2022, 48 (8): 292- 298, 305.
3	KIM Y. Convolutional neural networks for sentence classification [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics(ACL), 2014: 1746-1751.
4	周燕.. 基于GloVe模型和注意力机制Bi-LSTM的文本分类方法. 电子测量技术, 2022, 45 (7): 42- 47.
5	DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24)[2022-08-06]. https://doi.org/10.48550/arXiv.1810.04805.
6	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc., 2017: 6000–6010.
7	刘欢, 张智雄, 王宇飞.. BERT模型的主要优化改进方法研究综述. 数据分析与知识发现, 2021, 5 (1): 3- 15.
8	邱宁佳, 贺金彪, 薛丽娇, 等.. 融合语义特征的加权朴素贝叶斯分类算法. 计算机工程与设计, 2020, 41 (9): 2523- 2529.
9	YU H, KIM S. SVM tutorial—Classification, regression and ranking[G]// BÄCK G, KOK J N. Handbook of Natural Computing. Berlin: Springer, 2012: 479-506.
10	HACOHEN-KERNER Y, MILLER D, YIGAL Y.. The influence of preprocessing on text classification using a bag-of-words representation. Plos One, 2020, 15 (5): 0232525.
11	MISHRA R K, UROLAGIN S. A Sentiment analysis-based hotel recommendation using TF-IDF Approach [C]// 2019 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE). IEEE, 2019: 811-815.
12	LILLEBERG J, ZHU Y, ZHANG Y. Support vector machines and word2vec for text classification with semantic features [C]// 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC). IEEE, 2015: 136-140.
13	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [EB/OL]. (2013-09-07)[2022-08-06]. https://doi.org/10.48550/arXiv.1301.3781.
14	PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (ACL), 2014: 1532-1543.
15	JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification [EB/OL]. (2016-08-09)[2022-08-06]. https://doi.org/10.48550/arXiv.1607.01759.
16	WANG T H, ZHANG L, HU W Y.. Bridging deep and multiple kernel learning: A review. Information Fusion, 2021, 67, 3- 13.
17	PINAR A J, RICE J, HU L Q, et al.. Efficient multiple kernel classification using feature and decision level fusion. IEEE Transactions on Fuzzy Systems, 2016, 25 (6): 1403- 1416.
18	XIAO P, YU X B, MINTZ A, et al. A generative-discriminative deep learning approach to classify radiology reports based on the presence of follow up recommendations [C]// Proceedings Volume 12469, Medical Imaging 2023: Imaging Informatics for Healthcare, Research, and Applications. Bellingham, WA, USA: SPIE, 2023: 155-162. DOI: 10.1117/12.2651950.
19	CHAUHAN V K, DAHIYA K, SHARMA A.. Problem formulations and solvers in linear SVM: A review. Artificial Intelligence Review, 2019, 52 (2): 803- 855.
20	SUN S L, SHAWE-TAYLOR J.. Sparse semi-supervised learning using conjugate functions. The Journal of Machine Learning Research, 2010, (11): 2423- 2455.
21	JI Y, SUN S L.. Multitask multiclass support vector machines: Model and experiments. Pattern Recognition, 2013, 46 (3): 914- 924.
22	SUN S L, XIE X J, DONG C.. Multiview learning with generalized eigenvalue proximal support vector machines. IEEE Transactions on Cybernetics, 2019, 49 (2): 688- 697.
23	SUN S L, XIE X J.. Semi-supervised support vector machines with tangent space intrinsic manifold regularization. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27 (9): 1827- 1839.
24	SUN S L, XIE X J, YANG M.. Multiview uncorrelated discriminant analysis. IEEE Transactions on Cybernetics, 2016, 46 (12): 3272- 3284.
25	ALI I M S, HARIPRASAD D. Hyper-heuristic salp swarm optimization of multi-kernel support vector machines for big data classification [J]. International Journal of Information Technology, 2023(15): 651-663.
26	BAO J, CHEN Y Y, YU L, et al.. A multi-scale kernel learning method and its application in image classification. Neurocomputing, 2017, 257, 16- 23.
27	PENG Z C, HU Q H, DANG J W.. Multi-kernel SVM based depression recognition using social media data. International Journal of Machine Learning and Cybernetics, 2019, (10): 43- 57.
28	TANG F, WU Y Q, ZHOU Y S.. Hybridizing grid search and support vector regression to predict the compressive strength of fly ash concrete. Advances in Civil Engineering, 2022, (Special Issue): 3601914.
29	BISCHL B, BINDER M, LANG M, et al. Hyperparameter optimization: Foundations, algorithms, best practices and open challenges[EB/OL]. (2021-11-24)[2022-08-06]. https://doi.org/10.48550/arXiv.2107.05847.

数据集	类别数/个	数据集大小/个	词库大小(词数)/个	验证集大小/个	平均句长(词数)/个
SST-1	5	11855	17836	2210	18
SST-2	2	9613	16185	1821	19
Subj	2	10000	21323	CV	23
TREC	6	5952	9692	500	10
CR	2	3775	5340	CV	19
MPQA	2	10606	6246	CV	3
CT	3	44955	62813	3798	24

模型	准确率/%
模型	CR	MPQA	Subj	SST-1	SST-2	TREC	CT
BOW	78.2	85.4	90.1	37.2	79.7	87.4	76.2
TF-IDF	78.7	85.7	90.9	24.4	71.0	83.9	74.4
Word2Vec	78.3	88.8	91.7	43.5	82.2	87.4	77.3
GloVe	79.4	88.1	92.5	42.2	83.0	87.2	78.6
FastText	80.4	88.8	92.9	44.7	83.6	88.4	79.5
BERT	83.8	89.4	96.2	45.5	84.7	88.0	81.9
MKL(GS)	84.8	91.4	97.4	46.4	85.8	89.4	82.1
MKL(RS)	85.2	90.9	96.8	46.0	85.3	89.9	82.6
MKL*	85.7	91.7	97.3	46.4	86.3	91.1	83.4

寻优方法	训练次数/次	训练耗时/s
寻优方法	训练次数/次	CR	MPQA	Subj	SST-1	SST-2	TREC	CT
GS	1680	3869.5	12709.9	9068.1	3158.4	1399.4	882.0	13726.2
RS	1680	3828.4	13244.1	8908.1	3343.2	1370.8	900.5	14465.0
空间分割	1280	3050.8	8571.1	6755.2	2406.4	1054.7	759.1	10718.6

[1]	徐秋荣, 朱鹏, 罗轶凤, 董启文. 金融领域中文命名实体识别研究进展[J]. 华东师范大学学报（自然科学版）, 2021, 2021(5): 1-13.
[2]	穆肇南, 刘梦珠, 孙界平, 王成. 基于演化算法的唐诗自动生成系统研究[J]. 华东师范大学学报（自然科学版）, 2020, 2020(6): 129-139.
[3]	郭晓哲, 彭敦陆, 张亚彤, 彭学桂. GRS: 一种面向电商领域智能客服的生成-检索式对话模型[J]. 华东师范大学学报（自然科学版）, 2020, 2020(5): 156-166.
[4]	王嘉宁, 何怡, 朱仁煜, 刘婷婷, 高明. 基于远程监督的关系抽取技术[J]. 华东师范大学学报（自然科学版）, 2020, 2020(5): 113-130.
[5]	韩程程, 李磊, 刘婷婷, 高明. 语义文本相似度计算方法[J]. 华东师范大学学报（自然科学版）, 2020, 2020(5): 95-112.
[6]	赵波, 田秀霞, 李灿. 基于自适应神经网络的电网稳定性预测[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 133-142.
[7]	陈远哲, 匡俊, 刘婷婷, 高明, 周傲英. 共指消解技术综述[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 16-35.
[8]	刘婷婷, 程涛, 金冈增, 王熙堃, 高明. 基于支持向量机的数学公式识别[J]. 华东师范大学学报(自然科学版), 2019, 2019(3): 78-85.
[9]	赵骥, 童卫青. 基于LBP和Gabor混合特征的近红外人脸识别[J]. 华东师范大学学报(自然科学版), 2016, 2016(4): 77-85.
[10]	黄振龙，郑骏，胡文心. 基于类间可分性DAG-SVM的文本分类[J]. 华东师范大学学报(自然科学版), 2013, 2013(3): 209-218.