基于代码和描述文本相融合的软件分类研究

doi:10.3969/j.issn.1000-5641.2025.01.004

摘要/Abstract

摘要：

第三方软件系统在现代软件开发过程中有着重要的作用. 软件开发人员根据需求, 在第三方软件库中检索合适的依赖库来构建软件, 可避免许多重复工作, 加快开发过程. 然而, 检索第三方依赖库的过程可能会很困难. 通常第三方软件库提供预设的标签 (类别) 给软件开发人员进行查找, 但是如果一个软件的预设标签被错误地标注, 软件开发人员就无法查找到其需要的库, 这势必会影响开发过程. 提出了一种软件分类模型来解决上述挑战, 模型结合方法向量、方法重要性和文本向量, 将未知类别的软件分类到已知类别. 鉴于此问题尚未有公开的数据集, 为此建立了一个数据集并公开, 此数据集包含来自Maven存储库的30种类别的120个软件系统. 在此自建数据集上对提出的分类模型进行了测试, 预测类别的准确度对于1个候选者的情况 (top-1) 为70%, 对于3个候选者的情况 (top-3) 则达到了90%. 实验结果表明, 所提模型可以有效用于对开源存储库中的软件系统分类, 辅助软件开发人员快速查找第三方库.

关键词: 软件分类, 第三方软件系统, 方法重要性分数, code2vec

Abstract:

Third-party software systems play a significant role in modern software development. Software developers build software based on requirements by retrieving appropriate dependency libraries from third-party software repositories, effectively avoiding repetitive wheel-building operations and thus speeding up the development process. However, retrieving third-party dependency libraries can be challenging. Typically, third-party software repositories provide preset tags (categories) for software developers to search. However, when a software’s preset tags are incorrectly labeled, software developers are unable to find the libraries required, and this inevitably affects the development process. This study proposes a software clustering model to address the aforementioned challenges. The model combines method vectors, method importance, and text vectors to categorize unknown categories of software into known categories. In addition, because no publicly available dataset exists for this problem, we built a dataset and made it publicly available. This clustering model was tested on a self-built dataset comprising 30 categories and software systems from the Maven repository. The accuracy of the prediction category was 70% for one candidate (top-1) and 90% for three candidates (top-3). The experimental results show that our model can help software developers find suitable software, can be useful for classifying software systems in open-source repositories, and can assist software developers in quickly locating third-party libraries.

Key words: software classification, third-party software system, method importance score, code2vec

中图分类号:

TP311.5

陈宇航, 王世宙, 汤正婷, 陈良育, 姜宁康. 基于代码和描述文本相融合的软件分类研究[J]. 华东师范大学学报（自然科学版）, 2025, 2025(1): 46-58.

Yuhang CHEN, Shizhou WANG, Zhengting TANG, Liangyu CHEN, Ningkang JIANG. Research on software classification based on the fusion of code and descriptive text[J]. J* E* C* N* U* N* S*, 2025, 2025(1): 46-58.

图/表 10

表1

图1

图2

图3

图4

图5

表2

表3

图6

表4

参考文献 21

1	SHARMA A, THUNG F, KOCHHAR P S, et al. Cataloging github repositories [C]// Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering. ACM, 2017: 314-319.
2	WANG T, WANG H M, YIN G, et al.. Tag recommendation for open source software. Frontiers of Computer Science, 2014, 8 (1): 69- 82.
3	WANG Y, LIU H X, GAO S Q, et al. Categorizing npm packages by analyzing the text information in software repositories [C]// Proceedings of the 28th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 2021: 53-60.
4	Al-KOFAHI J M, TAMRAWI A, NGUYEN T T, et al. Fuzzy set approach for automatic tagging in evolving software [C]// Proceedings of the 2010 IEEE International Conference on Software Maintenance. IEEE, 2010. DOI: 10.1109/ICSM.2010.5609751.
5	RADOSAVLJEVIC V, GRBOVIC M, DJURIC N, et al. Smartphone app categorization for interest targeting in advertising marketplace [C]// Proceedings of the 25th International Conference Companion on World Wide Web. Geneva: International World Wide Web Conferences Steering Committee, 2016: 93-94.
6	YUSOF Y, ALHERSH T, MAHMUDDIN M, et al. Classification of machine learning engines using latent semantic indexing [C]// Knowledge Management International Conference (KMLCe). Kedah Darul Aman, Malaysia: Universiti Utara Malaysia (UUM), 2012: 472-476.
7	郑珏, 欧毓毅.. 基于卷积神经网络与多特征融合恶意代码分类方法. 计算机应用研究, 2022, 39 (1): 240- 244.
8	轩勃娜, 李进.. 基于改进 CNN 的恶意软件分类方法. 电子学报, 2023, 51 (5): 1187- 1197.
9	谷勇浩, 王翼翡, 刘威歆, 等.. 基于多重异质图的恶意软件相似性度量方法. 软件学报, 2023, 34 (7): 3188- 3205.
10	VARGAS-BALDRICH S, LINARES-VÁSQUEZ M, POSHYVANYK D. Automated tagging of software projects using bytecode and dependencies [C]// Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2015: 289-294.
11	YANG L, WANG L, HU Z G, et al. Automatic tagging for open source software by utilizing package dependency information [C]// Proceedings of the 2020 International Symposium on Theoretical Aspects of Software Engineering (TASE). IEEE, 2020: 137-144.
12	HAMEDNAI M R, KIM G, CHO S.. SimAndro: An effective method to compute similarity of Android applications. Soft Computing, 2019, 23, 7569- 7590.
13	LI M L, LU Q, LONG Y F. Representation learning of multiword expressions with compositionality constraint [C]// Knowledge Science, Engineering and Management, KSEM 2017, Lecture Notes in Computer Science, vol 10412. Cham: Springer, 2017: 507-519.
14	ALON U, ZILBERSTEIN M, LEVY O, et al.. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 2019, 3 (POPL): 40.
15	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [EB/OL].(2013-09-07)[2023-09-05]. https://doi.org/10.48550/arXiv.1301.3781.
16	COMPTON R, FRANK E, PATROS P, et al. Embedding Java classes with code2vec: Improvements from variable obfuscation [C]// Proceedings of the 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR). IEEE, 2020: 243-253.
17	LI H, WANG T, PAN W F, et al.. Mining key classes in Java projects by examining a very small number of classes: A complex network-based approach. IEEE Access, 2021, 9, 28076- 28088.
18	陶佩. 基于复杂网络的软件项目重要类识别研究[D]. 上海: 华东师范大学, 2022.
19	BRIN S, PAGE L.. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998, 30 (1/2/3/4/5/6/7): 107- 117.
20	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics (ACL), 2019: 4171-4186.
21	BROWN T, MANN B, RYDER N, et al.. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020, 33 (1): 1877- 1901.

相关研究	文本描述	源代码(字节码)	依赖关系
Al-Kofahi等^[4]	√
Radosavljevic等^[5]			√
Yusof等^[6]			√
郑珏等^[7]		√
轩勃娜等^[8]		√
谷勇浩等^[9]		√
Vergas-Baldrich等^[10]		√	√
Yang等^[11]	√		√
Hamedani等^[12]	√		√
本文方法	√	√	√

软件名称	软件类别	预测的类别
Lambspec	Assertion	Assertion, StringUtilities, redisClient
OkapiBarcode	Barcode	Barcode, MachineLearning, Encryption
Javassist	BytecodeLibs	BytecodeLibs, Reflection, ClasspathTools
Jmemcached	CacheImps	CacheImps, Microbenchmark, ParserGens
Zclasspath	ClasspathTools	ClasspathTools, VirtualFileSystem, Assertion
Dasein	cloudComputing	cloudComputing, ORM, httpClients
Jopt	CmdLineParsers	CmdLineParsers, Microbenchmark, ParserGens
Snappy	compressLibs	compressLibs, HashingLibs, Encryption
Jmdns	DNSLibs	SSH Library, cloudComputing, compressLibs
Unirest	httpClients	httpClients, cloudComputing, CmdLineParsers
Jersey	JSONLibs	JSONLibs, httpClients, cloudComputing
Jsontoken	JWTLibs	JWTLibs, cloudComputing, DNSLibs
OpenIMAJ	MachineLearning	MachineLearning, MathLibs, Barcode
JTS	MathLibs	Barcode, ParserGens, MathLibs
KoPeMe	Microbenchmark	Microbenchmark, CacheImps, cloudComputing
MyBatis	ORM	ORM, cloudComputing, httpClients
ToucanPdf	PDFLibs	Barcode, PDFLibs, MachineLearning
Reb4j	RegexLibs	ParserGens, redisClient, StringUtilities
TrueZip	VirtualFileSystem	VirtualFileSystem, CacheImps, httpClients
Wasync	websocketClients	httpClients, cloudComputing, websocketClients
Jsoup	html parser	html parser, ParserGens, httpClients
JParsec	ParserGens	Assertion, ParserGens, ORM
Redisson	redisClient	CacheImps, redisClient, Microbenchmark
WildFly	Security	Security, Assertion, ORM
SSHJ	SSH Library	SSH Library, httpClients, cloudComputing
tomgibara	HashingLibs	JWTLibs, HashingLibs, UUIDGens
Reflection-Util	Reflection	Reflection, BytecodeLibs, ORM
Vt-Crypt	Encryption	Encryption, SSH Library, JWTLibs
UUID-Creator	UUIDGens	UUIDGens, cloudComputing, JWTLibs
Joda-Convert	StringUtilities	Reflection, ORM, JWTLibs

软件名称	软件类别	预测归属的类别
gpars	ActorFrameworks	无
HdrHistogram	ApplicationMetrics	无
jongo	MongoClient	无
jdom2	xmlProcess	html parser
bobo	SearchEngines	无
generex	RegularExpressionLibraries	无
DeephacksCached	OffHeapLibraries	无
commonmark	Markdown	html parser
log4j	logging	CmdLineParsers
jmxutils	JMXLibraries	无
activeio	IOUtilities	无
bitsy	GraphDatabases	无
ftpserver	FTP	无
fastexcel	ExcelLibraries	无
activej	DependencyInjection	无
CheckerQual	defectDetect	无
dateutils	DateandTimeUtilities	无
jansi	ConsoleUtilities	无
awaitility	concurrent	无
jcommander	CommandLineParsers	无

[1]	陈杰, 沈文怡, 吴问宇, 毛嘉莉. 面向骑行地图推断的轨迹数据质量提升方法[J]. 华东师范大学学报（自然科学版）, 2023, 2023(6): 14-27.
[2]	郁毅明, 洪语晨, 王晔, 董启文. 化工材料配方的实验数据治理模块设计[J]. 华东师范大学学报（自然科学版）, 2022, 2022(5): 1-13.
[3]	孙晴, 梁冠宇, 武延军, 武斌, 田春岐, 王伟. 数据驱动的开源软件供应链可维护性风险分析方法[J]. 华东师范大学学报（自然科学版）, 2022, 2022(5): 90-99.
[4]	龚鑫, 徐立华, 窦亮, 赵瑞祥. 金融科技软件自动化测试用例的冗余评价和削减方法[J]. 华东师范大学学报（自然科学版）, 2022, 2022(4): 43-55.
[5]	纪宇, 何一璇, 吴国群, 吴敏. 基于Prony-like方法的第一类贝塞尔函数逼近[J]. 华东师范大学学报(自然科学版), 2019, 2019(6): 42-60.
[6]	张涛, 张小磊, 李宇明, 张春熙, 张蓉. Woodpecker+:基于数据特征的自定义负载性能评测[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 190-202.
[7]	张衡, 陈良育. Levenshtein算法优化及在题库判重中的应用[J]. 华东师范大学学报(自然科学版), 2018, 2018(5): 154-163.
[8]	李捷荧, 李宇明, 张小磊, 张蓉. Woodpecker:支持细粒度冲突模拟的数据库测试框架[J]. 华东师范大学学报(自然科学版), 2018, 2018(2): 77-88.
[9]	杨乐, 柳银萍, 李志斌. Emathema：在线的方程自动求解平台[J]. 华东师范大学学报(自然科学版), 2017, (3): 20-28.
[10]	赵大鹏,梁磊,田秀霞,王晓玲. LBS的隐私保护：模型与进展[J]. 华东师范大学学报(自然科学版), 2015, 2015(5): 28-45.
[11]	韩文文;王玲;陈优广 . 基于亚像素文本图像的分割算法 [J]. 华东师范大学学报(自然科学版), 2007, 2007(3): 100-106.
[12]	顾星晔;王琦;过仲阳;吴健平. 地理数据多边形动态显示的方法研究(简报)[J]. 华东师范大学学报(自然科学版), 2007, 2007(2): 122-125.
[13]	王远飞;陆涛;宓伟杰;朱海燕;邵德民;鲁小琴;冯泾贤. 基于GIS技术的热带气旋信息系统设计与实现(简报)[J]. 华东师范大学学报(自然科学版), 2006, 2006(4): 137-140.
[14]	王远飞;周枫;刘志强;宓伟杰;陆涛;丁金宏. 浦东新区人口普查地理信息系统的设计与实现[J]. 华东师范大学学报(自然科学版), 2006, 2006(2): 27-32.
[15]	张圣希;张薇;李国强;顾国庆. 利用顶点链编码探测表格的斜率[J]. 华东师范大学学报(自然科学版), 2004, 2004(3): 54-58.