Application and evaluation of large language models in open source project topic annotation

doi:10.3969/j.issn.1000-5641.2025.05.002

Abstract

Abstract:

With the rapid development of open source communities, the number of GitHub projects has increased exponentially. However, a considerable portion of these projects lack explicit topic labels, creating challenges for developers in technology selection and project retrieval processes. Existing topic generation methods rely primarily on supervised learning paradigms that suffer from strong dependencies on high-quality annotated data and other limitations. This study addresses the accuracy and efficiency issues in open source community project topic annotation by conducting the first comprehensive study on the application effectiveness of large language models in GitHub project topic prediction tasks. We constructed a dataset containing 3000 popular GitHub projects that were selected based on a quantitative metric specifically designed to evaluate the activity and influence of open source projects, encompassing multidimensional features including repository names, README documents, and description information. Comparative experiments were conducted using several mainstream large language models from domestic and international sources including Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, GPT-4o, and Qwen-Plus. The results demonstrated that Claude 3.7 Sonnet achieved optimal performance across most evaluation metrics, and as the dataset scale expanded, the performances of all models tended to stabilize. The experiments proved that large language models exhibited excellent applicability in project topic annotation tasks, although significant performance differences existed among different models. These findings provide an important reference foundation for open source community project management and intelligent annotation system design.

Key words: large language model, repository mining, topic annotation, open source dataset

CLC Number:

TP391

Dexin HE, Fanyu HAN, Wei WANG. Application and evaluation of large language models in open source project topic annotation[J]. J* E* C* N* U* N* S*, 2025, 2025(5): 14-24.

Figures/Tables 8

Fig.1

Table 1

Table 2

Fig.2

Fig.3

Table 3

Table 4

Table 5

References 24

1	THUNG F, BISSYANDÉ T F, LO D, et al. Network structure of social coding in GitHub [C]// 2013 17th European Conference on Software Maintenance and Reengineering. IEEE, 2013: 323-326.
2	CHEN K Y, TORO-MORENO M, SUBRAMANIAM A R. GitHub is an effective platform for collaborative and reproducible laboratory research [EB/OL]. (2025-02-10)[2025-05-25]. http://arxiv.org/abs/2408.09344.
3	FORMAN G. BNS feature scaling: An improved representation over tf-idf for svm text classification [C]// Proceedings of the 17th ACM conference on Information and Knowledge Management. ACM, 2008: 263-270.
4	IZADI M, HEYDARNOORI A, GOUSIOS G.. Topic recommendation for software repositories using multi-label classification algorithms. Empirical Software Engineering, 2021, 26 (5): 93.
5	BASART S, DUBA S, FERRI C, et al. GPT-4 technical report [EB/OL]. (2024-03-04)[2025-05-25]. http://arxiv.org/abs/2303.08774.
6	SONODA Y, KUROKAWA R, NAKAMURA Y, et al. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases [J]. Japanese Journal of Radiology, 2024, 42(11): 1231-1235.
7	ANIL R, BORGEAUD S, ALAYRAC J-B, et al. Gemini: A family of highly capable multimodal models [EB/OL]. (2025-05-09)[2025-05-25]. http://arxiv.org/abs/2312.11805.
8	Anthropic. Claude 3.7 Sonnet and Claude Code [EB/OL]. (2025-02-25)[2025-05-25]. https://www.anthropic.com/news/claude-3-7-sonnet.
9	LIU A X, FENG B, XUE B, et al. DeepSeek-V3 technical report [EB/OL]. (2025-02-18)[2025-05-25]. http://arxiv.org/abs/2412.19437.
10	YANG A, LI A F, YANG B S, et al. Qwen3 technical report [A/OL]. (2025-05-14)[2025-05-25]. http://arxiv.org/abs/2505.09388.
11	ZHAO S Y, XIA X Y, FITZGERALD B, et al. OpenRank leaderboard: Motivating open source collaborations through social network evaluation in Alibaba [C]// Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice. ACM, 2024: 346-357.
12	ZHANG J Q, SUN Y C, ZHOU Y Q, et al. Exploring GitHub topics: Unveiling their content and potential [C]// 2024 IEEE International Conference on Software Services Engineering (SSE). IEEE, 2024: 25-35.
13	DI ROCCO J, DI RUSCIO D, DI SIPIO C, et al.. HybridRec: A recommender system for tagging GitHub repositories. Applied Intelligence, 2023, 53, 9708- 9730.
14	KALLIAMVAKOU E, GOUSIOS G, BLINCOE K, et al. The promises and perils of mining GitHub [C]// Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, 2014: 92-101.
15	BORGES H, HORA A, VALENTE M T. Understanding the factors that impact the popularity of GitHub repositories [C]// 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2016: 334-344.
16	KIM Y. Convolutional neural network for sentence classification [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2014: 1746–1751.
17	JELODAR H, WANG Y L, ORJI R, et al.. Deep sentiment classification and topic discovery on novel coronavirus or COVID-19 online discussions: NLP using LSTM recurrent neural network approach. IEEE Journal of Biomedical and Health Informatics, 2020, 24 (10): 2733- 2742.
18	FLORIDI L, CHIRIATTI M.. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 2020, 30, 681- 694.
19	KOROTEEV M V. BERT: A review of applications in natural language processing and understanding [EB/OL]. (2021-03-22)[2025-05-25]. http://arxiv.org/abs/2103.11943.
20	ABBURI H, SUESSERMAN M, PUDOTA N, et al. Generative AI text classification using ensemble LLM approaches [EB/OL]. (2023-09-14)[2025-05-25]. http://arxiv.org/abs/2309.07755.
21	KUBLIK S, SABOO S. GPT-3: The Ultimate Guide to Building NLP Products with OpenAI API [M]. Birmingham, UK: Packt Publishing Ltd. , 2023.
22	XU B W, HOANG T, SHARMA A, et al.. Post2vec: Learning distributed representations of stack overflow posts. IEEE Transactions on Software Engineering, 2021, 48 (9): 3423- 3441.
23	WANG X Y, XIA X, LO D.. TagCombine: Recommending tags to contents in software information sites. Journal of Computer Science and Technology, 2015, 30 (5): 1017- 1035.
24	COSENTINO V, LUIS J, CABOT J. Findings from GitHub: Methods, datasets and limitations [C]// Proceedings of the 13th International Conference on Mining Software Repositories. ACM, 2016: 137-141.

数据集	仓库总数/个	主题种类总数/个	出现最多次的仓库主题数/个	缺少描述的仓库总数/个
GitHub热门仓库	3000	9623	5	25

模型	P@3	R@3	F₁@3	S@3	P@4	R@4	F₁@4	S@4	P@5	R@5	F₁@5	S@5
GPT	0.44	0.21	0.29	0.78	0.39	0.24	0.30	0.81	0.35	0.26	0.30	0.83
DeepSeek	0.51	0.24	0.32	0.84	0.45	0.27	0.34	0.87	0.41	0.30	0.35	0.88
Qwen	0.47	0.22	0.30	0.81	0.42	0.25	0.32	0.84	0.38	0.28	0.32	0.86
Gemini	0.39	0.19	0.25	0.68	0.35	0.22	0.27	0.71	0.32	0.24	0.27	0.72
Claude	0.51	0.24	0.33	0.85	0.45	0.28	0.35	0.88	0.41	0.31	0.36	0.89

仓库总数/个	描述缺失仓库/个	仓库描述平均长度/字符	README 缺失仓库/个	README 平均长度/字符	真实主题平均数量/个
188	12	73	0	7317	2.7

主题文本内容	出现次数/次	包含该主题的仓库占比/%
hacktoberfest	54	28.7
jenkins-cft	6	3.2
docs	3	1.6
open source	3	1.6
jenkins-cft-a-c	3	1.6

大语言模型	输入价格/美元	输出价格/美元
Claude	3.00	15.00
DeepSeek	0.27	1.10
Qwen	0.11	0.28
GPT	5.00	15.00
Gemini	0.10	0.40