基于多维特征融合的GitHub开发者地理位置预测

doi:10.3969/j.issn.1000-5641.2025.05.001

摘要/Abstract

摘要：

开发者地理位置信息对理解全球开源活动分布和制定区域政策具有重要意义. 然而, GitHub平台上存在大量开发者账户缺失地理位置信息, 因而限制了对全球开源生态系统地理分布的全面分析. 提出了一种基于多维特征融合的层次化地理位置预测框架, 通过整合时间行为、语言文化、网络特征这3大类多维特征, 构建了规则驱动快速定位、姓名文化推断、时区交叉验证、深度学习集成的4层递进预测机制. 基于50000名全球活跃开发者构建的大规模数据集的实验表明, 该方法成功预测了82.52%开发者的地理位置信息. 其中, 姓名文化推断层覆盖用户最多, 准确率达到了0.7629; 深度学习集成层处理最复杂案例, 准确率为0.7557. 通过对比Moonshot大语言模型的预测结果, 验证了该方法在复杂地理推断任务中的优势.

关键词: GitHub, 多维特征, 深度学习, 地理位置预测

Abstract:

The geographic location information of developers is important for understanding the global distribution of open source activities and formulating regional policies. However, a substantial number of developer accounts on the GitHub platform lack geographic location information, limiting the comprehensive analysis of the geographic distribution of the global open source ecosystem. This study proposed a hierarchical geographic location prediction framework based on multidimensional feature fusion. By integrating three major categories of multidimensional features—temporal behavior, linguistic culture, and network characteristics—the framework established a four-tier progressive prediction mechanism consisting of rule-driven rapid positioning, name cultural inference, time zone cross-validation, and a deep learning ensemble. Experiments conducted on a large-scale dataset built from 50000 globally active developers demonstrated that this method successfully predicted the geographic locations of 82.52% of the developers. Among these, the name cultural inference layer covered most users with an accuracy of 0.7629, whereas the deep learning ensemble layer handled the most complex cases with an accuracy of 0.7557. A comparative analysis with the prediction results from the Moonshot large language model validated the superiority of the proposed method in complex geographic inference tasks.

Key words: GitHub, multi-dimensional feature, deep learning, geographic location prediction

中图分类号:

TP39

赵思嘉, 韩凡宇, 王伟. 基于多维特征融合的GitHub开发者地理位置预测[J]. 华东师范大学学报（自然科学版）, 2025, 2025(5): 1-13.

Sijia ZHAO, Fanyu HAN, Wei WANG. Research on the GitHub developer geographic location prediction method based on multi-dimensional feature fusion[J]. J* E* C* N* U* N* S*, 2025, 2025(5): 1-13.

图/表 6

图1

表1

表2

表3

表4

表5

参考文献 36

1	DABBISH L, STUART C, TSAY J, et al. Social coding in GitHub: Transparency and collaboration in an open software repository [C]// Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work. 2012: 1277-1286.
2	WACHS J, NITECKI M, SCHUELLER W, et al.. The geography of open source software: evidence from GitHub. Technological Forecasting and Social Change, 2022, 176, 121478.
3	NAGLE F.. Open source software and firm productivity. Management Science, 2019, 65 (3): 1191- 1215.
4	WRIGHT N L, NAGLE F, GREENSTEIN S.. Open source software and global entrepreneurship. Research Policy, 2023, 52 (9): 104846.
5	RUSK D, COADY Y. Location-based analysis of developers and technologies on GitHub [C]// 2014 28th International Conference on Advanced Information Networking and Applications Workshops. IEEE, 2014: 681-685.
6	ALBUSAYS K, BJORN P, DABBISH L, et al.. The diversity crisis in software development. IEEE Software, 2021, 38 (2): 19- 25.
7	MAY A, WACHS J, HANNÁK A.. Gender differences in participation and reward on Stack Overflow. Empirical Software Engineering, 2019, 24, 1997- 2019.
8	PRANA G A A, FORD D, RASTOGI A, et al.. Including everyone, everywhere: Understanding opportunities and challenges of geographic gender-inclusion in OSS. IEEE Transactions on Software Engineering, 2021, 48 (9): 3394- 3409.
9	DAHLANDER L, GANN D M, WALLIN M W.. How open is innovation? A retrospective and ideas forward. Research Policy, 2021, 50 (4): 104218.
10	TAKHTEYEV Y. Coding Places: Software Practice in a South American City [M]. Cambridge MA, USA: MIT Press, 2012.
11	SHAIKH M, VAAST E.. Folding and unfolding: Balancing openness and transparency in open source communities. Information Systems Research, 2016, 27 (4): 813- 833.
12	ZHANG S, ZHENG D Q, HU X C, et al. Bidirectional long short-term memory networks for relation classification [C]// Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 2015: 73-78.
13	袁霖, 王怀民, 尹刚, 等.. 开源环境下开发人员行为特征挖掘与分析. 计算机学报, 2010, 33 (10): 1910- 1918.
14	FACKLER T, LAURENTSYEVA N. Gravity in online collaborations: evidence from GitHub [C]// CESifo Forum, 2020, 21(3): 15-20.
15	LIMA A, ROSSI L, MUSOLESI M. Coding together at scale: GitHub as a collaborative social network [C]// Proceedings of the International AAAI Conference on Web and Social Media. 2014, 8(1): 295-304.
16	NADRI R, RODRÍGUEZ-PÉREZ G, NAGAPPAN M.. On the relationship between the developer’s perceptible race and ethnicity and the evaluation of contributions in OSS. IEEE Transactions on Software Engineering, 2021, 48 (8): 2955- 2968.
17	RASTOGI A, NAGAPPAN N, GOUSIOS G, et al. Relationship between geographical location and evaluation of developer contributions in GitHub [C]// Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 2018: 22.
18	RASTOGI A. Do biases related to geographical location influence work-related decisions in GitHub? [C]// Proceedings of the 38th International Conference on Software Engineering Companion. 2016: 665-667.
19	QUERCIA D, CAPRA L, CROWCROFT J. The social world of Twitter: Topics, geography, and emotions [C]// Proceedings of the International AAAI Conference on Web and Social Media. 2012, 6(1): 298-305.
20	JAIDKA K, GIORGI S, SCHWARTZ H A, et al.. Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods. Proceedings of the National Academy of Sciences, 2020, 117 (19): 10165- 10171.
21	KOLLANYI B.. Where do bots come from? An analysis of bot codes shared on GitHub. International Journal of Communication, 2016, 10, 4932- 4951.
22	KOMOSNY D, MEHIC M.. The value of geographic locations submitted by Internet users. IEEE Access, 2018, 6, 62699- 62706.
23	HAN B, COOK P, BALDWIN T.. Text-based twitter user geolocation prediction. Journal of Artificial Intelligence Research, 2014, 49, 451- 500.
24	ROSSI D, ZACCHIROLI S. Geographic diversity in public code contributions: an exploratory large-scale study over 50 years [C]// Proceedings of the 19th International Conference on Mining Software Repositories. 2022: 80-85.
25	LE TOURNEAU T, LATENDRESSE J, ABDELLATIF A, et al. Code mapper: Mapping the Global contributions of OSS [C]// Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 2024: 44-48.
26	XIA X Y, ZHAO S Y, HAN F Y, et al. OpenDigger: Data mining and information service system for open collaboration digital ecosystem [EB/OL]. (2023-11-26)[2025-01-10]. https://doi.org/10.48550/arXiv.2311.15204.
27	Forebears [EB/OL]. [2025-01-10]. https://forebears.io/about/name-distribution-and-demographics.
28	CLAES M, MÄNTYLÄ M V, KUUTILA M, et al. Do programmers work at night or during the weekend? [C]// Proceedings of the 40th International Conference on Software Engineering. 2018: 705-715.
29	ZHAO S Y, XIA X Y, FITZGERALD B, et al. OpenRank leaderboard: Motivating open source collaborations through social network evaluation in Alibaba [C]// Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice. 2024: 346-357.
30	BREIMAN L.. Random forests. Machine Learning, 2001, 45, 5- 32.
31	FRIEDMAN J H.. Stochastic gradient boosting. Computational Statistics & Data Analysis, 2002, 38 (4): 367- 378.
32	CHEN T Q, GUESTRIN C. XGBoost: A scalable tree boosting system [C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.
33	HOCHREITER S, SCHMIDHUBER J.. Long short-term memory. Neural Computation, 1997, 9 (8): 1735- 1780.
34	DOROGUSH A V, ERSHOV V, GULIN A. CatBoost: Gradient boosting with categorical features support [EB/OL]. (2018-10-24) [2025-01-10]. https://doi.org/10.48550/arXiv.1810.11363.
35	LaVALLEY M P.. Logistic regression. Circulation, 2008, 117 (18): 2395- 2399.
36	GEURTS P, ERNST D, WEHENKEL L.. Extremely randomized trees. Machine Learning, 2006, 63, 3- 42.

预测方法	预测正确用户数/名	成功预测用户数占比/%	准确率
TLD规则匹配	3 011	6.02	0.7332
IP地址定位	5 454	10.91	0.6257
姓名+时区交叉验证	21320	42.64	0.7641
多模型集成验证	11 476	22.95	0.7557

模型	准确率	精确率	召回率	F₁值
XGBoost^[32]	0.7405	0.7151	0.7405	0.7276
LSTM^[33]	0.7355	0.7047	0.7355	0.7198
BiLSTM^[12]	0.7396	0.7109	0.7396	0.7250
Gradient Boosting^[31]	0.7292	0.7156	0.7292	0.7223
CatBoost^[34]	0.6998	0.6529	0.6998	0.6755
随机森林^[30]	0.6900	0.6365	0.6900	0.6622
逻辑回归^[35]	0.6792	0.6240	0.6792	0.6503
Extra Trees^[36]	0.6549	0.5921	0.6549	0.6218
集成模型	0.7557	0.7165	0.7557	0.7356

模型	准确率	精确率	召回率	F₁值
所有特征	0.7557	0.7165	0.7557	0.7356
移除时间行为特征	0.6463	0.5927	0.6463	0.6183
移除语言文化特征	0.5292	0.5156	0.5292	0.5223
移除网络行为特征	0.4388	0.4091	0.4388	0.4234

国家	准确率
巴西	0.3733
俄罗斯	0.3471
日本	0.2781
印度	0.2698
中国	0.2355
印度尼西亚	0.1712
德国	0.1676
加拿大	0.1132
英国	0.0812
美国	0.0521

[1]	曹鹭萍, 夏勇. 大气湍流下高分辨率高带宽分数涡旋光束的探测[J]. 华东师范大学学报（自然科学版）, 2025, 2025(3): 51-60.
[2]	黄彩蝶, 王昕萍, 陈良育, 刘勇. 基于堆叠门控循环单元残差网络的知识追踪模型研究[J]. 华东师范大学学报（自然科学版）, 2022, 2022(6): 68-78.
[3]	马依琳, 陶慧玲, 董启文, 王晔. 基于Transformer的多特征融合的航空发动机剩余使用寿命预测[J]. 华东师范大学学报（自然科学版）, 2022, 2022(5): 219-232.
[4]	王泽杰, 沈超敏, 赵春, 刘新妹, 陈杰. 融合人体姿态估计和目标检测的学生课堂行为识别[J]. 华东师范大学学报（自然科学版）, 2022, 2022(2): 55-66.
[5]	刘波, 白晓东, 张更新, 沈俊, 谢继东, 赵来定, 洪涛. 深度学习在认知无线电中的应用研究综述[J]. 华东师范大学学报（自然科学版）, 2021, 2021(1): 36-52.
[6]	张旭, 黄定江. 基于深度学习的铝材表面缺陷检测[J]. 华东师范大学学报（自然科学版）, 2020, 2020(6): 105-114.
[7]	韩程程, 李磊, 刘婷婷, 高明. 语义文本相似度计算方法[J]. 华东师范大学学报（自然科学版）, 2020, 2020(5): 95-112.
[8]	杨康, 黄定江, 高明. 面向自动问答的机器阅读理解综述[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 36-52.
[9]	陈远哲, 匡俊, 刘婷婷, 高明, 周傲英. 共指消解技术综述[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 16-35.
[10]	刘恒宇, 张天成, 武培文, 于戈. 知识追踪综述[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 1-15.
[11]	叶健, 赵慧. 基于大规模弹幕数据监听和情感分类的舆情分析模型[J]. 华东师范大学学报(自然科学版), 2019, 2019(3): 86-100.
[12]	袁培森, 张勇, 李美玲, 顾兴健. 基于深度哈希学习的商标图像检索研究[J]. 华东师范大学学报(自然科学版), 2018, 2018(5): 172-182.
[13]	余若男, 黄定江, 董启文. 基于深度学习的场景文字检测研究进展[J]. 华东师范大学学报(自然科学版), 2018, 2018(5): 1-16.