华东师范大学学报(自然科学版) ›› 2025, Vol. 2025 ›› Issue (5): 1-13.doi: 10.3969/j.issn.1000-5641.2025.05.001

• AI赋能的开源技术与应用 •    

基于多维特征融合的GitHub开发者地理位置预测

赵思嘉, 韩凡宇, 王伟*()   

  1. 华东师范大学 数据科学与工程学院, 上海 200062
  • 收稿日期:2025-01-15 出版日期:2025-09-25 发布日期:2025-09-25
  • 通讯作者: 王伟 E-mail:wwang@dase.ecnu.edu.cn
  • 基金资助:
    国家自然科学基金(62137001, 62277017, 61977026)

Research on the GitHub developer geographic location prediction method based on multi-dimensional feature fusion

Sijia ZHAO, Fanyu HAN, Wei WANG*()   

  1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
  • Received:2025-01-15 Online:2025-09-25 Published:2025-09-25
  • Contact: Wei WANG E-mail:wwang@dase.ecnu.edu.cn

摘要:

开发者地理位置信息对理解全球开源活动分布和制定区域政策具有重要意义. 然而, GitHub平台上存在大量开发者账户缺失地理位置信息, 因而限制了对全球开源生态系统地理分布的全面分析. 提出了一种基于多维特征融合的层次化地理位置预测框架, 通过整合时间行为、语言文化、网络特征这3大类多维特征, 构建了规则驱动快速定位、姓名文化推断、时区交叉验证、深度学习集成的4层递进预测机制. 基于50000名全球活跃开发者构建的大规模数据集的实验表明, 该方法成功预测了82.52%开发者的地理位置信息. 其中, 姓名文化推断层覆盖用户最多, 准确率达到了0.7629; 深度学习集成层处理最复杂案例, 准确率为0.7557. 通过对比Moonshot大语言模型的预测结果, 验证了该方法在复杂地理推断任务中的优势.

关键词: GitHub, 多维特征, 深度学习, 地理位置预测

Abstract:

The geographic location information of developers is important for understanding the global distribution of open source activities and formulating regional policies. However, a substantial number of developer accounts on the GitHub platform lack geographic location information, limiting the comprehensive analysis of the geographic distribution of the global open source ecosystem. This study proposed a hierarchical geographic location prediction framework based on multidimensional feature fusion. By integrating three major categories of multidimensional features—temporal behavior, linguistic culture, and network characteristics—the framework established a four-tier progressive prediction mechanism consisting of rule-driven rapid positioning, name cultural inference, time zone cross-validation, and a deep learning ensemble. Experiments conducted on a large-scale dataset built from 50000 globally active developers demonstrated that this method successfully predicted the geographic locations of 82.52% of the developers. Among these, the name cultural inference layer covered most users with an accuracy of 0.7629, whereas the deep learning ensemble layer handled the most complex cases with an accuracy of 0.7557. A comparative analysis with the prediction results from the Moonshot large language model validated the superiority of the proposed method in complex geographic inference tasks.

Key words: GitHub, multi-dimensional feature, deep learning, geographic location prediction

中图分类号: