华东师范大学学报(自然科学版) ›› 2025, Vol. 2025 ›› Issue (5): 140-150.doi: 10.3969/j.issn.1000-5641.2025.05.013

• 开源生态发展与治理 • 上一篇    

基于DTA的GitHub高星仓库活跃度评估方法

游明东, 彭佳恒, 韩凡宇, 王伟*()   

  1. 华东师范大学 数据科学与工程学院, 上海 200062
  • 收稿日期:2025-07-03 接受日期:2025-07-09 出版日期:2025-09-25 发布日期:2025-09-25
  • 通讯作者: 王伟 E-mail:wwang@dase.ecnu.edu.cn

A DTA based activity evaluation method for high star GitHub repositories

Mingdong YOU, Jiaheng PENG, Fanyu HAN, Wei WANG*()   

  1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
  • Received:2025-07-03 Accepted:2025-07-09 Online:2025-09-25 Published:2025-09-25
  • Contact: Wei WANG E-mail:wwang@dase.ecnu.edu.cn

摘要:

以识别GitHub长期活跃高星仓库帮助开源社区构建和数字基础设施建设为背景, 提出了一种基于时间序列预测模型的GitHub高星仓库长期活跃度评估方法, 旨在解决识别仓库是否能够保持长期活跃的问题. 该方法首次引入开发者活跃周期作为关键特征, 用以提升仓库发展趋势预测的准确性. 通过对活动指标的时间序列数据进行建模与挖掘, 该方法提出了全新的活跃度计算公式DTA (Development Trend-based Activity), 实现了对仓库活跃水平的准确量化评估. 设计并制作了一个时间粒度细、覆盖范围广的基准数据集, 并系统评估了多种预测模型的表现, 最终确定了适用于开源仓库活跃度预测的最优模型. 实验结果验证了所提方法的有效性, 能够准确预测仓库的长期活跃情况. 因此, 引入DTA对仓库活跃度进行评估, 能够帮助开源参与者识别长期活跃的仓库, 确定参与重心, 促进开源社区的构建和数字基础设施建设.

关键词: 开源软件仓库, 活跃度评估, 开源社区

Abstract:

In the context of identifying GitHub’s long-term active, high-star repositories—critical for assisting the development of robust open-source communities and vital digital infrastructure—we propose a novel method for evaluating the long-term activity of these repositories. This method is firmly based on a time series prediction model, which excels in forecasting repository activity metrics rather than being specifically designed for this purpose. A key innovation of our method is the first-time use of the developer activity cycle as a pivotal feature. This improves the accuracy of predictions for repository development trends and provides a more nuanced understanding of project evolution. After meticulously modeling and mining the time series data of various activity indicators, we developed a new activity calculation formula: development trend-based activity (DTA). This formula allows a precise quantitative evaluation of a repository's true activity level. To rigorously validate our methodology, we designed and curated a comprehensive benchmark dataset with fine time granularity and broad coverage. Subsequently, we systematically evaluated the performance of multiple prediction models against this dataset, eventually identifying the best model for forecasting open-source repository activity. The experimental results conclusively demonstrate the effectiveness of our proposed method in accurately predicting the long-term activity of repositories. Consequently, using DTA to evaluate repository activity can enable open-source participants to effectively identify repositories poised for long-term engagement, strategically determine their participation focus, and thereby significantly promote the sustained development of open-source communities and critical digital infrastructure.

Key words: open-source repository, activity assessment, open-source community

中图分类号: