华东师范大学学报(自然科学版) ›› 2025, Vol. 2025 ›› Issue (5): 14-24.doi: 10.3969/j.issn.1000-5641.2025.05.002

• AI赋能的开源技术与应用 • 上一篇    下一篇

大语言模型在开源项目主题标注中的应用与评估研究

何德鑫, 韩凡宇, 王伟*()   

  1. 华东师范大学 数据科学与工程学院, 上海 200062
  • 收稿日期:2025-06-27 出版日期:2025-09-25 发布日期:2025-09-25
  • 通讯作者: 王伟 E-mail:wwang@dase.ecnu.edu.cn

Application and evaluation of large language models in open source project topic annotation

Dexin HE, Fanyu HAN, Wei WANG*()   

  1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
  • Received:2025-06-27 Online:2025-09-25 Published:2025-09-25
  • Contact: Wei WANG E-mail:wwang@dase.ecnu.edu.cn

摘要:

随着开源社区的快速发展, GitHub项目的数量持续激增; 然而一部分项目未提供明确的主题标签, 给开发者在技术选型和项目检索的过程中带来了挑战. 现有的主题生成方法主要依赖于监督学习范式, 存在对高质量标注数据有较强依赖性等问题. 针对开源项目主题标注的准确性及效率问题, 首次研究了大语言模型在GitHub项目主题预测任务中的应用效果; 构建了包含3000个GitHub热门项目的数据集, 涵盖项目仓库名、README文档和描述信息等多维度特征; 选择Claude 3.7 Sonnet、DeepSeek-V3、Gemini 2.0 Flash、GPT-4o和Qwen-Plus等数个国内外主流大语言模型进行了对比实验. 实验结果表明, Claude 3.7 Sonnet在多数评估指标上表现最优, 且随着数据集规模扩大, 各模型的性能表现趋于稳定. 实验证明, 大语言模型在项目主题标注任务中展现出了良好的适用性, 但不同模型间存在显著性能差异, 这为开源项目管理和智能化标注系统设计提供了重要参考依据.

关键词: 大语言模型, 仓库挖掘, 主题标注, 开源数据集

Abstract:

With the rapid development of open source communities, the number of GitHub projects has increased exponentially. However, a considerable portion of these projects lack explicit topic labels, creating challenges for developers in technology selection and project retrieval processes. Existing topic generation methods rely primarily on supervised learning paradigms that suffer from strong dependencies on high-quality annotated data and other limitations. This study addresses the accuracy and efficiency issues in open source community project topic annotation by conducting the first comprehensive study on the application effectiveness of large language models in GitHub project topic prediction tasks. We constructed a dataset containing 3000 popular GitHub projects that were selected based on a quantitative metric specifically designed to evaluate the activity and influence of open source projects, encompassing multidimensional features including repository names, README documents, and description information. Comparative experiments were conducted using several mainstream large language models from domestic and international sources including Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, GPT-4o, and Qwen-Plus. The results demonstrated that Claude 3.7 Sonnet achieved optimal performance across most evaluation metrics, and as the dataset scale expanded, the performances of all models tended to stabilize. The experiments proved that large language models exhibited excellent applicability in project topic annotation tasks, although significant performance differences existed among different models. These findings provide an important reference foundation for open source community project management and intelligent annotation system design.

Key words: large language model, repository mining, topic annotation, open source dataset

中图分类号: