华东师范大学学报(自然科学版) ›› 2025, Vol. 2025 ›› Issue (1): 46-58.doi: 10.3969/j.issn.1000-5641.2025.01.004

• 计算机科学 • 上一篇    下一篇

基于代码和描述文本相融合的软件分类研究

陈宇航, 王世宙, 汤正婷, 陈良育, 姜宁康*()   

  1. 华东师范大学 软件工程学院, 上海 200062
  • 收稿日期:2023-10-31 出版日期:2025-01-25 发布日期:2025-01-20
  • 通讯作者: 姜宁康 E-mail:nkjiang@sei.ecnu.edu.cn
  • 基金资助:
    国家自然科学基金 (62272416)

Research on software classification based on the fusion of code and descriptive text

Yuhang CHEN, Shizhou WANG, Zhengting TANG, Liangyu CHEN, Ningkang JIANG*()   

  1. Software Engineering Institute, East China Normal University, Shanghai 200062, China
  • Received:2023-10-31 Online:2025-01-25 Published:2025-01-20
  • Contact: Ningkang JIANG E-mail:nkjiang@sei.ecnu.edu.cn

摘要:

第三方软件系统在现代软件开发过程中有着重要的作用. 软件开发人员根据需求, 在第三方软件库中检索合适的依赖库来构建软件, 可避免许多重复工作, 加快开发过程. 然而, 检索第三方依赖库的过程可能会很困难. 通常第三方软件库提供预设的标签 (类别) 给软件开发人员进行查找, 但是如果一个软件的预设标签被错误地标注, 软件开发人员就无法查找到其需要的库, 这势必会影响开发过程. 提出了一种软件分类模型来解决上述挑战, 模型结合方法向量、方法重要性和文本向量, 将未知类别的软件分类到已知类别. 鉴于此问题尚未有公开的数据集, 为此建立了一个数据集并公开, 此数据集包含来自Maven存储库的30种类别的120个软件系统. 在此自建数据集上对提出的分类模型进行了测试, 预测类别的准确度对于1个候选者的情况 (top-1) 为70%, 对于3个候选者的情况 (top-3) 则达到了90%. 实验结果表明, 所提模型可以有效用于对开源存储库中的软件系统分类, 辅助软件开发人员快速查找第三方库.

关键词: 软件分类, 第三方软件系统, 方法重要性分数, code2vec

Abstract:

Third-party software systems play a significant role in modern software development. Software developers build software based on requirements by retrieving appropriate dependency libraries from third-party software repositories, effectively avoiding repetitive wheel-building operations and thus speeding up the development process. However, retrieving third-party dependency libraries can be challenging. Typically, third-party software repositories provide preset tags (categories) for software developers to search. However, when a software’s preset tags are incorrectly labeled, software developers are unable to find the libraries required, and this inevitably affects the development process. This study proposes a software clustering model to address the aforementioned challenges. The model combines method vectors, method importance, and text vectors to categorize unknown categories of software into known categories. In addition, because no publicly available dataset exists for this problem, we built a dataset and made it publicly available. This clustering model was tested on a self-built dataset comprising 30 categories and software systems from the Maven repository. The accuracy of the prediction category was 70% for one candidate (top-1) and 90% for three candidates (top-3). The experimental results show that our model can help software developers find suitable software, can be useful for classifying software systems in open-source repositories, and can assist software developers in quickly locating third-party libraries.

Key words: software classification, third-party software system, method importance score, code2vec

中图分类号: