大语言模型在开源项目主题标注中的应用与评估研究

doi:10.3969/j.issn.1000-5641.2025.05.002

摘要/Abstract

摘要：

随着开源社区的快速发展, GitHub项目的数量持续激增; 然而一部分项目未提供明确的主题标签, 给开发者在技术选型和项目检索的过程中带来了挑战. 现有的主题生成方法主要依赖于监督学习范式, 存在对高质量标注数据有较强依赖性等问题. 针对开源项目主题标注的准确性及效率问题, 首次研究了大语言模型在GitHub项目主题预测任务中的应用效果; 构建了包含3000个GitHub热门项目的数据集, 涵盖项目仓库名、README文档和描述信息等多维度特征; 选择Claude 3.7 Sonnet、DeepSeek-V3、Gemini 2.0 Flash、GPT-4o和Qwen-Plus等数个国内外主流大语言模型进行了对比实验. 实验结果表明, Claude 3.7 Sonnet在多数评估指标上表现最优, 且随着数据集规模扩大, 各模型的性能表现趋于稳定. 实验证明, 大语言模型在项目主题标注任务中展现出了良好的适用性, 但不同模型间存在显著性能差异, 这为开源项目管理和智能化标注系统设计提供了重要参考依据.

关键词: 大语言模型, 仓库挖掘, 主题标注, 开源数据集

Abstract:

With the rapid development of open source communities, the number of GitHub projects has increased exponentially. However, a considerable portion of these projects lack explicit topic labels, creating challenges for developers in technology selection and project retrieval processes. Existing topic generation methods rely primarily on supervised learning paradigms that suffer from strong dependencies on high-quality annotated data and other limitations. This study addresses the accuracy and efficiency issues in open source community project topic annotation by conducting the first comprehensive study on the application effectiveness of large language models in GitHub project topic prediction tasks. We constructed a dataset containing 3000 popular GitHub projects that were selected based on a quantitative metric specifically designed to evaluate the activity and influence of open source projects, encompassing multidimensional features including repository names, README documents, and description information. Comparative experiments were conducted using several mainstream large language models from domestic and international sources including Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, GPT-4o, and Qwen-Plus. The results demonstrated that Claude 3.7 Sonnet achieved optimal performance across most evaluation metrics, and as the dataset scale expanded, the performances of all models tended to stabilize. The experiments proved that large language models exhibited excellent applicability in project topic annotation tasks, although significant performance differences existed among different models. These findings provide an important reference foundation for open source community project management and intelligent annotation system design.

Key words: large language model, repository mining, topic annotation, open source dataset

中图分类号:

TP391

何德鑫, 韩凡宇, 王伟. 大语言模型在开源项目主题标注中的应用与评估研究[J]. 华东师范大学学报（自然科学版）, 2025, 2025(5): 14-24.

Dexin HE, Fanyu HAN, Wei WANG. Application and evaluation of large language models in open source project topic annotation[J]. J* E* C* N* U* N* S*, 2025, 2025(5): 14-24.

图/表 8

图1

表1

表2

图2

图3

表3

表4

表5

参考文献 24

1	THUNG F, BISSYANDÉ T F, LO D, et al. Network structure of social coding in GitHub [C]// 2013 17th European Conference on Software Maintenance and Reengineering. IEEE, 2013: 323-326.
2	CHEN K Y, TORO-MORENO M, SUBRAMANIAM A R. GitHub is an effective platform for collaborative and reproducible laboratory research [EB/OL]. (2025-02-10)[2025-05-25]. http://arxiv.org/abs/2408.09344.
3	FORMAN G. BNS feature scaling: An improved representation over tf-idf for svm text classification [C]// Proceedings of the 17th ACM conference on Information and Knowledge Management. ACM, 2008: 263-270.
4	IZADI M, HEYDARNOORI A, GOUSIOS G.. Topic recommendation for software repositories using multi-label classification algorithms. Empirical Software Engineering, 2021, 26 (5): 93.
5	BASART S, DUBA S, FERRI C, et al. GPT-4 technical report [EB/OL]. (2024-03-04)[2025-05-25]. http://arxiv.org/abs/2303.08774.
6	SONODA Y, KUROKAWA R, NAKAMURA Y, et al. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases [J]. Japanese Journal of Radiology, 2024, 42(11): 1231-1235.
7	ANIL R, BORGEAUD S, ALAYRAC J-B, et al. Gemini: A family of highly capable multimodal models [EB/OL]. (2025-05-09)[2025-05-25]. http://arxiv.org/abs/2312.11805.
8	Anthropic. Claude 3.7 Sonnet and Claude Code [EB/OL]. (2025-02-25)[2025-05-25]. https://www.anthropic.com/news/claude-3-7-sonnet.
9	LIU A X, FENG B, XUE B, et al. DeepSeek-V3 technical report [EB/OL]. (2025-02-18)[2025-05-25]. http://arxiv.org/abs/2412.19437.
10	YANG A, LI A F, YANG B S, et al. Qwen3 technical report [A/OL]. (2025-05-14)[2025-05-25]. http://arxiv.org/abs/2505.09388.
11	ZHAO S Y, XIA X Y, FITZGERALD B, et al. OpenRank leaderboard: Motivating open source collaborations through social network evaluation in Alibaba [C]// Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice. ACM, 2024: 346-357.
12	ZHANG J Q, SUN Y C, ZHOU Y Q, et al. Exploring GitHub topics: Unveiling their content and potential [C]// 2024 IEEE International Conference on Software Services Engineering (SSE). IEEE, 2024: 25-35.
13	DI ROCCO J, DI RUSCIO D, DI SIPIO C, et al.. HybridRec: A recommender system for tagging GitHub repositories. Applied Intelligence, 2023, 53, 9708- 9730.
14	KALLIAMVAKOU E, GOUSIOS G, BLINCOE K, et al. The promises and perils of mining GitHub [C]// Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, 2014: 92-101.
15	BORGES H, HORA A, VALENTE M T. Understanding the factors that impact the popularity of GitHub repositories [C]// 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2016: 334-344.
16	KIM Y. Convolutional neural network for sentence classification [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2014: 1746–1751.
17	JELODAR H, WANG Y L, ORJI R, et al.. Deep sentiment classification and topic discovery on novel coronavirus or COVID-19 online discussions: NLP using LSTM recurrent neural network approach. IEEE Journal of Biomedical and Health Informatics, 2020, 24 (10): 2733- 2742.
18	FLORIDI L, CHIRIATTI M.. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 2020, 30, 681- 694.
19	KOROTEEV M V. BERT: A review of applications in natural language processing and understanding [EB/OL]. (2021-03-22)[2025-05-25]. http://arxiv.org/abs/2103.11943.
20	ABBURI H, SUESSERMAN M, PUDOTA N, et al. Generative AI text classification using ensemble LLM approaches [EB/OL]. (2023-09-14)[2025-05-25]. http://arxiv.org/abs/2309.07755.
21	KUBLIK S, SABOO S. GPT-3: The Ultimate Guide to Building NLP Products with OpenAI API [M]. Birmingham, UK: Packt Publishing Ltd. , 2023.
22	XU B W, HOANG T, SHARMA A, et al.. Post2vec: Learning distributed representations of stack overflow posts. IEEE Transactions on Software Engineering, 2021, 48 (9): 3423- 3441.
23	WANG X Y, XIA X, LO D.. TagCombine: Recommending tags to contents in software information sites. Journal of Computer Science and Technology, 2015, 30 (5): 1017- 1035.
24	COSENTINO V, LUIS J, CABOT J. Findings from GitHub: Methods, datasets and limitations [C]// Proceedings of the 13th International Conference on Mining Software Repositories. ACM, 2016: 137-141.

数据集	仓库总数/个	主题种类总数/个	出现最多次的仓库主题数/个	缺少描述的仓库总数/个
GitHub热门仓库	3000	9623	5	25

模型	P@3	R@3	F₁@3	S@3	P@4	R@4	F₁@4	S@4	P@5	R@5	F₁@5	S@5
GPT	0.44	0.21	0.29	0.78	0.39	0.24	0.30	0.81	0.35	0.26	0.30	0.83
DeepSeek	0.51	0.24	0.32	0.84	0.45	0.27	0.34	0.87	0.41	0.30	0.35	0.88
Qwen	0.47	0.22	0.30	0.81	0.42	0.25	0.32	0.84	0.38	0.28	0.32	0.86
Gemini	0.39	0.19	0.25	0.68	0.35	0.22	0.27	0.71	0.32	0.24	0.27	0.72
Claude	0.51	0.24	0.33	0.85	0.45	0.28	0.35	0.88	0.41	0.31	0.36	0.89

仓库总数/个	描述缺失仓库/个	仓库描述平均长度/字符	README 缺失仓库/个	README 平均长度/字符	真实主题平均数量/个
188	12	73	0	7317	2.7

主题文本内容	出现次数/次	包含该主题的仓库占比/%
hacktoberfest	54	28.7
jenkins-cft	6	3.2
docs	3	1.6
open source	3	1.6
jenkins-cft-a-c	3	1.6

大语言模型	输入价格/美元	输出价格/美元
Claude	3.00	15.00
DeepSeek	0.27	1.10
Qwen	0.11	0.28
GPT	5.00	15.00
Gemini	0.10	0.40