J* E* C* N* U* N* S* ›› 2025, Vol. 2025 ›› Issue (5): 14-24.doi: 10.3969/j.issn.1000-5641.2025.05.002

• AI-Enabled Open Source Technologies and Applications • Previous Articles     Next Articles

Application and evaluation of large language models in open source project topic annotation

Dexin HE, Fanyu HAN, Wei WANG*()   

  1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
  • Received:2025-06-27 Online:2025-09-25 Published:2025-09-25
  • Contact: Wei WANG E-mail:wwang@dase.ecnu.edu.cn

Abstract:

With the rapid development of open source communities, the number of GitHub projects has increased exponentially. However, a considerable portion of these projects lack explicit topic labels, creating challenges for developers in technology selection and project retrieval processes. Existing topic generation methods rely primarily on supervised learning paradigms that suffer from strong dependencies on high-quality annotated data and other limitations. This study addresses the accuracy and efficiency issues in open source community project topic annotation by conducting the first comprehensive study on the application effectiveness of large language models in GitHub project topic prediction tasks. We constructed a dataset containing 3000 popular GitHub projects that were selected based on a quantitative metric specifically designed to evaluate the activity and influence of open source projects, encompassing multidimensional features including repository names, README documents, and description information. Comparative experiments were conducted using several mainstream large language models from domestic and international sources including Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, GPT-4o, and Qwen-Plus. The results demonstrated that Claude 3.7 Sonnet achieved optimal performance across most evaluation metrics, and as the dataset scale expanded, the performances of all models tended to stabilize. The experiments proved that large language models exhibited excellent applicability in project topic annotation tasks, although significant performance differences existed among different models. These findings provide an important reference foundation for open source community project management and intelligent annotation system design.

Key words: large language model, repository mining, topic annotation, open source dataset

CLC Number: