基于开源代码大语言模型提示的学生代码修复

doi:10.3969/j.issn.1000-5641.2024.05.009

摘要/Abstract

摘要：

随着机器学习技术的进步, 旨在学习人类修复错误代码模式的自动程序修复技术可以辅助学生修复错误代码, 提高学生的自主学习效率. 在过去, 自动程序修复模型或是基于人工设计的符号规则, 或是基于数据驱动的方法. 随着具有强大自然语言理解能力和代码生成能力的大语言模型的出现, 一些研究尝试使用提示工程进行自动程序修复. 然而, 现有研究主要评估诸如Codex和GPT-4这样的商用模型, 一方面大规模使用的成本较高, 另一方面在教育场景下存在数据隐私隐患. 此外, 这些研究大多使用简单的提示形式来评估模型修复程序的能力, 且缺乏对结果的深入分析. 为弥补上述工作的不足, 通过提示工程评估了两个代表性的开源代码大语言模型, 测试了不同的提示方法, 例如思维链和少样本学习, 并对结果进行了深入分析, 最后提出了一些将大语言模型和编程教育场景结合的建议.

关键词: 自动程序修复, 大语言模型, 提示工程

Abstract:

Advancements in machine-learning technology has enabled automated program-repair techniques that learn human patterns of erroneous-code fixing, thereby assisting students in debugging and enhancing their self-directed learning efficiency. Automatic program-repair models are typically based on either manually designed symbolic rules or data-driven methods. Owing the availability of large language models that possess excellent natural-language understanding and code-generation capabilities, researchers have attempted to use prompt engineering for automatic program repair. However, existing studies primarily evaluate commercial models such as Codex and GPT-4, which may incur high costs for large-scale adoption and cause data-privacy issues in educational scenarios. Furthermore, these studies typically employ simple prompt forms to assess the program-repair capabilities of large language models, whereas the results are not analyzed comprehensively. Hence, we evaluate two representative open-source code large language models with excellent code-generation capability using prompt engineering. We evaluate different prompting methods, such as chain-of-thought and few-shot learning, and analyze the results comprehensively. Finally, we provide suggestions for integrating large language models into programming educational scenarios.

Key words: automatic program repair, large language models, prompt engineering

中图分类号:

TP391

陈郅睿, 陆雪松. 基于开源代码大语言模型提示的学生代码修复[J]. 华东师范大学学报（自然科学版）, 2024, 2024(5): 93-103.

Zhirui CHEN, Xuesong LU. Prompting open-source code large language models for student program repair[J]. Journal of East China Normal University(Natural Science), 2024, 2024(5): 93-103.

图/表 8

图1

表1

表2

表3

图2

图3

图4

图5

参考文献 35

1	SUN Q, CHEN Z, XU F, et al. A survey of neural code intelligence: Paradigms, advances and beyond [EB/OL]. (2024-03-21)[2024-07-30]. https://doi.org/10.48550/arXiv.2403.14734.
2	GUPTA R, PAL S, KANADE A, et al. Deepfix: Fixing common C language errors by deep learning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2017: 1345-1351.
3	GULWANI S, RADIČEK I, ZULEGER F.. Automated clustering and program repair for introductory programming assignments. ACM SIGPLAN Notices, 2018, 53 (4): 465- 480.
4	WANG K, SINGH R, SU Z. Search, align, and repair: Data-driven feedback generation for introductory programming exercises [C]// Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 2018: 481-495.
5	AHMED U Z, KUMAR P, KARKARE A, et al. Compilation error repair: For the student programs, from the student programs [C]// Proceedings of the 40th International Conference on Software Engineering: Software Engineering Education and Training. 2018: 78-87.
6	BHATIA S, KOHLI P, SINGH R. Neuro-symbolic program corrector for introductory programming assignments [C]// Proceedings of the 40th International Conference on Software Engineering. 2018: 60-70.
7	HAN S, WANG Y, LU X. Errorclr: Semantic error classification, localization and repair for introductory programming assignments [C]// Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023: 1345-1354.
8	HU Y, AHMED U Z, MECHTAEV S, et al. Re-factoring based program repair applied to programming assignments [C]// 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2019: 388-398.
9	VASIC M, KANADE A, MANIATIS P, et al. Neural program repair by jointly learning to localize and repair [EB/OL]. (2019-04-03)[2024-07-30]. https://doi.org/10.48550/arXiv.1904.01720.
10	YASUNAGA M, LIANG P. Break-it-fix-it: Unsupervised learning for program repair [C]// International Conference on Machine Learning. PMLR, 2021: 11941-11952.
11	BERABI B, HE J, RAYCHEV V, et al. Tfix: Learning to fix coding errors with a text-to-text transformer [C]// International Conference on Machine Learning. PMLR, 2021: 780-791.
12	LI Y, WANG S, NGUYEN T N. Dlfix: Context-based code transformation learning for automated program repair [C]// Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 2020: 602-614.
13	LUTELLIER T, PHAM H V, PANG L, et al. Coconut: Combining context-aware neural translation models using ensemble for program repair [C]// Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 2020: 101-114.
14	JOSHI H, SANCHEZ J C, GULWANI S, et al. Repair is nearly generation: Multilingual program repair with LLMs [EB/OL]. (2022-08-24)[2024-07-30]. https://doi.org/10.48550/arXiv.2208.11640.
15	ZHANG J, CAMBRONERO J, GULWANI S, et al. Repairing bugs in python assignments using large language models [EB/OL]. (2022-09-29)[2024-07-30]. https://doi.org/10.48550/arXiv.2209.14876.
16	JIANG N, LIU K, LUTELLIER T, et al. Impact of code language models on automated program repair [C]// 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023: 1430-1442.
17	FAN Z, GAO X, MIRCHEV M, et al. Automated repair of programs from large language models [C]// 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023: 1469-1481.
18	SHIRAFUJI A, RAHMAN M M, AMIN M F I, et al. Program repair with minimal edits using codet5 [C]// 2023 12th International Conference on Awareness Science and Technology (iCAST). IEEE, 2023: 178-184.
19	PHUNG T, PĂDUREAN V A, CAMBRONERO J, et al. Generative AI for programming education: Benchmarking ChatGPT, GPT-4, and human tutors [C]// Proceedings of the 2023 ACM Conference on International Computing Education Research-Volume 2. 2023: 41-42.
20	ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report [EB/OL]. (2023-03-15)[2024-07-30]. https://doi.org/10.48550/arXiv.2303.08774.
21	TOUVRON H, LAVRIL T, IZACARD G, et al. Llama: Open and efficient foundation language models [EB/OL]. (2023-02-27)[2024-07-30]. https://doi.org/10.48550/arXiv.2302.13971.
22	TOUVRON H, MARTIN L, STONE K, et al. Llama 2: Open foundation and fine-tuned chat models [EB/OL]. (2023-07-18)[2024-07-30]. https://doi.org/10.48550/arXiv.2307.09288.
23	YANG A, XIAO B, WANG B, et al. Baichuan 2: Open large-scale language models [EB/OL]. (2023-09-19)[2024-07-30]. https://doi.org/10.48550/arXiv.2309.10305.
24	BAI J, BAI S, CHU Y, et al. Qwen technical report [EB/OL]. (2023-09-28)[2024-07-30]. https://doi.org/10.48550/arXiv.2309.16609.
25	CHEN M, TWOREK J, JUN H, et al. Evaluating large language models trained on code [EB/OL]. (2021-07-07)[2024-07-30]. https://doi.org/10.48550/arXiv.2107.03374.
26	GUO D, ZHU Q, YANG D, et al. DeepSeek-Coder: When the large language model meets programming - The rise of code intelligence [EB/OL]. (2024-01-25)[2024-07-30]. https://doi.org/10.48550/arXiv.2401.14196.
27	ZHENG T, ZHANG G, SHEN T, et al. OpenCodeInterpreter: Integrating code generation with execution and refinement [EB/OL]. (2024-02-22)[2024-07-30]. https://doi.org/10.48550/arXiv.2402.14658.
28	NIJKAMP E, PANG B, HAYASHI H, et al. Codegen: An open large language model for code with multi-turn program synthesis [EB/OL]. (2022-03-25)[2024-07-30]. https://doi.org/10.48550/arXiv.2203.13474.
29	WANG Y, LE H, GOTMARE A D, et al. Codet5 +: Open code large language models for code understanding and generation [EB/OL]. (2023-05-13)[2024-07-30]. https://doi.org/10.48550/arXiv.2305.07922.
30	ROZIERE B, GEHRING J, GLOECKLE F, et al. Code Llama: Open foundation models for code [EB/OL]. (2023-08-24)[2024-07-30]. https://doi.org/10.48550/arXiv.2308.12950.
31	LUO Z, XU C, ZHAO P, et al. Wizardcoder: Empowering code large language models with evol-instruct [EB/OL]. (2023-06-14)[2024-07-30]. https://doi.org/10.48550/arXiv.2306.08568.
32	LI R, ALLAL L B, ZI Y, et al. Starcoder: May the source be with you! [EB/OL]. (2023-05-09)[2024-07-30]. https://doi.org/10.48550/arXiv.2305.06161.
33	WEI J, WANG X, SCHUURMANS D, et al.. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 2022, 35, 24824- 24837.
34	BROWN T, MANN B, RYDER N, et al.. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020, 33, 1877- 1901.
35	WEI J, TAY Y, BOMMASANI R, et al. Emergent abilities of large language models [EB/OL]. (2022-06-15)[2024-07-30]. https://doi.org/10.48550/arXiv.2206.07682.

模型	简单提示 + 零样本	简单提示 + 单样本	简单提示 + 双样本	简单提示 + 三样本	思维链 + 零样本	思维链 + 单样本	思维链 + 双样本	思维链 + 三样本
DeepSeek-Coder-6.7B	76%	75%	72%	72%	81%	70%	69%	65%
DeepSeek-Coder-33B	79%	75%	81%	80%	86%	74%	85%	80%
OpenCodeInterpreter-6.7B	72%	66%	74%	64%	74%	62%	71%	70%
OpenCodeInterpreter-33B	80%	67%	52%	37%	77%	65%	46%	51%
GPT-4	87%	88%	88%	88%	87%	81%	88%	88%

模型	简单提示 + 零样本	简单提示 + 单样本	简单提示 + 双样本	简单提示 + 三样本	思维链 + 零样本	思维链 + 单样本	思维链 + 双样本	思维链 + 三样本
DeepSeek-Coder-6.7B	50.2%	65.2%	67.8%	68.8%	47.0%	63.8%	71.4%	66.6%
DeepSeek-Coder-33B	48.0%	67.4%	71.4%	74.6%	50.6%	66.8%	67.4%	68.2%
OpenCodeInterpreter-6.7B	42.0%	53.8%	49.6%	58.6%	42.0%	51.4%	52.4%	51.4%
OpenCodeInterpreter-33B	57.6%	55.8%	62.8%	59.2%	53.8%	57.8%	60.2%	56.7%
GPT-4	76.1%	77.6%	86.0%	80.2%	75.5%	76.7%	81.5%	77.2%

模型	简单提示 + 零样本	简单提示 + 单样本	简单提示 + 双样本	简单提示 + 三样本	思维链 + 零样本	思维链 + 单样本	思维链 + 双样本	思维链 + 三样本
DeepSeek-Coder-6.7B	79.4%	82.8%	87.4%	83.5%	79.1%	84.7%	88.8%	86.3%
DeepSeek-Coder-33B	81.6%	88.1%	87.1%	89.8%	79.5%	87.1%	86.8%	89.8%
OpenCodeInterpreter-6.7B	75.4%	77.2%	73.8%	82.9%	75.2%	77.2%	80.2%	78.5%
OpenCodeInterpreter-33B	89.2%	83.2%	84.1%	82.0%	81.1%	85.5%	87.5%	80.5%
GPT-4	93.7%	95.1%	95.8%	94.4%	86.3%	89.8%	90.8%	92.5%

[1]	寇思佳, 闫凤云, 马晶. 国内大语言模型在学科知识图谱自动标注上的应用——以道德与法治和数学学科为例[J]. 华东师范大学学报（自然科学版）, 2024, 2024(5): 81-92.
[2]	刘佳, 孙新, 张宇晴. 知识图谱与大语言模型协同的教育资源内容审查[J]. 华东师范大学学报（自然科学版）, 2024, 2024(5): 57-69.