Prompting open-source code large language models for student program repair

doi:10.3969/j.issn.1000-5641.2024.05.009

Abstract

Abstract:

Advancements in machine-learning technology has enabled automated program-repair techniques that learn human patterns of erroneous-code fixing, thereby assisting students in debugging and enhancing their self-directed learning efficiency. Automatic program-repair models are typically based on either manually designed symbolic rules or data-driven methods. Owing the availability of large language models that possess excellent natural-language understanding and code-generation capabilities, researchers have attempted to use prompt engineering for automatic program repair. However, existing studies primarily evaluate commercial models such as Codex and GPT-4, which may incur high costs for large-scale adoption and cause data-privacy issues in educational scenarios. Furthermore, these studies typically employ simple prompt forms to assess the program-repair capabilities of large language models, whereas the results are not analyzed comprehensively. Hence, we evaluate two representative open-source code large language models with excellent code-generation capability using prompt engineering. We evaluate different prompting methods, such as chain-of-thought and few-shot learning, and analyze the results comprehensively. Finally, we provide suggestions for integrating large language models into programming educational scenarios.

Key words: automatic program repair, large language models, prompt engineering

CLC Number:

TP391

Zhirui CHEN, Xuesong LU. Prompting open-source code large language models for student program repair[J]. Journal of East China Normal University(Natural Science), 2024, 2024(5): 93-103.

Figures/Tables 8

Fig.1

Table 1

Table 2

Table 3

Fig.2

Fig.3

Fig.4

Fig.5

References 35

1	SUN Q, CHEN Z, XU F, et al. A survey of neural code intelligence: Paradigms, advances and beyond [EB/OL]. (2024-03-21)[2024-07-30]. https://doi.org/10.48550/arXiv.2403.14734.
2	GUPTA R, PAL S, KANADE A, et al. Deepfix: Fixing common C language errors by deep learning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2017: 1345-1351.
3	GULWANI S, RADIČEK I, ZULEGER F.. Automated clustering and program repair for introductory programming assignments. ACM SIGPLAN Notices, 2018, 53 (4): 465- 480.
4	WANG K, SINGH R, SU Z. Search, align, and repair: Data-driven feedback generation for introductory programming exercises [C]// Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 2018: 481-495.
5	AHMED U Z, KUMAR P, KARKARE A, et al. Compilation error repair: For the student programs, from the student programs [C]// Proceedings of the 40th International Conference on Software Engineering: Software Engineering Education and Training. 2018: 78-87.
6	BHATIA S, KOHLI P, SINGH R. Neuro-symbolic program corrector for introductory programming assignments [C]// Proceedings of the 40th International Conference on Software Engineering. 2018: 60-70.
7	HAN S, WANG Y, LU X. Errorclr: Semantic error classification, localization and repair for introductory programming assignments [C]// Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023: 1345-1354.
8	HU Y, AHMED U Z, MECHTAEV S, et al. Re-factoring based program repair applied to programming assignments [C]// 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2019: 388-398.
9	VASIC M, KANADE A, MANIATIS P, et al. Neural program repair by jointly learning to localize and repair [EB/OL]. (2019-04-03)[2024-07-30]. https://doi.org/10.48550/arXiv.1904.01720.
10	YASUNAGA M, LIANG P. Break-it-fix-it: Unsupervised learning for program repair [C]// International Conference on Machine Learning. PMLR, 2021: 11941-11952.
11	BERABI B, HE J, RAYCHEV V, et al. Tfix: Learning to fix coding errors with a text-to-text transformer [C]// International Conference on Machine Learning. PMLR, 2021: 780-791.
12	LI Y, WANG S, NGUYEN T N. Dlfix: Context-based code transformation learning for automated program repair [C]// Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 2020: 602-614.
13	LUTELLIER T, PHAM H V, PANG L, et al. Coconut: Combining context-aware neural translation models using ensemble for program repair [C]// Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 2020: 101-114.
14	JOSHI H, SANCHEZ J C, GULWANI S, et al. Repair is nearly generation: Multilingual program repair with LLMs [EB/OL]. (2022-08-24)[2024-07-30]. https://doi.org/10.48550/arXiv.2208.11640.
15	ZHANG J, CAMBRONERO J, GULWANI S, et al. Repairing bugs in python assignments using large language models [EB/OL]. (2022-09-29)[2024-07-30]. https://doi.org/10.48550/arXiv.2209.14876.
16	JIANG N, LIU K, LUTELLIER T, et al. Impact of code language models on automated program repair [C]// 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023: 1430-1442.
17	FAN Z, GAO X, MIRCHEV M, et al. Automated repair of programs from large language models [C]// 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023: 1469-1481.
18	SHIRAFUJI A, RAHMAN M M, AMIN M F I, et al. Program repair with minimal edits using codet5 [C]// 2023 12th International Conference on Awareness Science and Technology (iCAST). IEEE, 2023: 178-184.
19	PHUNG T, PĂDUREAN V A, CAMBRONERO J, et al. Generative AI for programming education: Benchmarking ChatGPT, GPT-4, and human tutors [C]// Proceedings of the 2023 ACM Conference on International Computing Education Research-Volume 2. 2023: 41-42.
20	ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report [EB/OL]. (2023-03-15)[2024-07-30]. https://doi.org/10.48550/arXiv.2303.08774.
21	TOUVRON H, LAVRIL T, IZACARD G, et al. Llama: Open and efficient foundation language models [EB/OL]. (2023-02-27)[2024-07-30]. https://doi.org/10.48550/arXiv.2302.13971.
22	TOUVRON H, MARTIN L, STONE K, et al. Llama 2: Open foundation and fine-tuned chat models [EB/OL]. (2023-07-18)[2024-07-30]. https://doi.org/10.48550/arXiv.2307.09288.
23	YANG A, XIAO B, WANG B, et al. Baichuan 2: Open large-scale language models [EB/OL]. (2023-09-19)[2024-07-30]. https://doi.org/10.48550/arXiv.2309.10305.
24	BAI J, BAI S, CHU Y, et al. Qwen technical report [EB/OL]. (2023-09-28)[2024-07-30]. https://doi.org/10.48550/arXiv.2309.16609.
25	CHEN M, TWOREK J, JUN H, et al. Evaluating large language models trained on code [EB/OL]. (2021-07-07)[2024-07-30]. https://doi.org/10.48550/arXiv.2107.03374.
26	GUO D, ZHU Q, YANG D, et al. DeepSeek-Coder: When the large language model meets programming - The rise of code intelligence [EB/OL]. (2024-01-25)[2024-07-30]. https://doi.org/10.48550/arXiv.2401.14196.
27	ZHENG T, ZHANG G, SHEN T, et al. OpenCodeInterpreter: Integrating code generation with execution and refinement [EB/OL]. (2024-02-22)[2024-07-30]. https://doi.org/10.48550/arXiv.2402.14658.
28	NIJKAMP E, PANG B, HAYASHI H, et al. Codegen: An open large language model for code with multi-turn program synthesis [EB/OL]. (2022-03-25)[2024-07-30]. https://doi.org/10.48550/arXiv.2203.13474.
29	WANG Y, LE H, GOTMARE A D, et al. Codet5 +: Open code large language models for code understanding and generation [EB/OL]. (2023-05-13)[2024-07-30]. https://doi.org/10.48550/arXiv.2305.07922.
30	ROZIERE B, GEHRING J, GLOECKLE F, et al. Code Llama: Open foundation models for code [EB/OL]. (2023-08-24)[2024-07-30]. https://doi.org/10.48550/arXiv.2308.12950.
31	LUO Z, XU C, ZHAO P, et al. Wizardcoder: Empowering code large language models with evol-instruct [EB/OL]. (2023-06-14)[2024-07-30]. https://doi.org/10.48550/arXiv.2306.08568.
32	LI R, ALLAL L B, ZI Y, et al. Starcoder: May the source be with you! [EB/OL]. (2023-05-09)[2024-07-30]. https://doi.org/10.48550/arXiv.2305.06161.
33	WEI J, WANG X, SCHUURMANS D, et al.. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 2022, 35, 24824- 24837.
34	BROWN T, MANN B, RYDER N, et al.. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020, 33, 1877- 1901.
35	WEI J, TAY Y, BOMMASANI R, et al. Emergent abilities of large language models [EB/OL]. (2022-06-15)[2024-07-30]. https://doi.org/10.48550/arXiv.2206.07682.

模型	简单提示 + 零样本	简单提示 + 单样本	简单提示 + 双样本	简单提示 + 三样本	思维链 + 零样本	思维链 + 单样本	思维链 + 双样本	思维链 + 三样本
DeepSeek-Coder-6.7B	76%	75%	72%	72%	81%	70%	69%	65%
DeepSeek-Coder-33B	79%	75%	81%	80%	86%	74%	85%	80%
OpenCodeInterpreter-6.7B	72%	66%	74%	64%	74%	62%	71%	70%
OpenCodeInterpreter-33B	80%	67%	52%	37%	77%	65%	46%	51%
GPT-4	87%	88%	88%	88%	87%	81%	88%	88%

模型	简单提示 + 零样本	简单提示 + 单样本	简单提示 + 双样本	简单提示 + 三样本	思维链 + 零样本	思维链 + 单样本	思维链 + 双样本	思维链 + 三样本
DeepSeek-Coder-6.7B	50.2%	65.2%	67.8%	68.8%	47.0%	63.8%	71.4%	66.6%
DeepSeek-Coder-33B	48.0%	67.4%	71.4%	74.6%	50.6%	66.8%	67.4%	68.2%
OpenCodeInterpreter-6.7B	42.0%	53.8%	49.6%	58.6%	42.0%	51.4%	52.4%	51.4%
OpenCodeInterpreter-33B	57.6%	55.8%	62.8%	59.2%	53.8%	57.8%	60.2%	56.7%
GPT-4	76.1%	77.6%	86.0%	80.2%	75.5%	76.7%	81.5%	77.2%

模型	简单提示 + 零样本	简单提示 + 单样本	简单提示 + 双样本	简单提示 + 三样本	思维链 + 零样本	思维链 + 单样本	思维链 + 双样本	思维链 + 三样本
DeepSeek-Coder-6.7B	79.4%	82.8%	87.4%	83.5%	79.1%	84.7%	88.8%	86.3%
DeepSeek-Coder-33B	81.6%	88.1%	87.1%	89.8%	79.5%	87.1%	86.8%	89.8%
OpenCodeInterpreter-6.7B	75.4%	77.2%	73.8%	82.9%	75.2%	77.2%	80.2%	78.5%
OpenCodeInterpreter-33B	89.2%	83.2%	84.1%	82.0%	81.1%	85.5%	87.5%	80.5%
GPT-4	93.7%	95.1%	95.8%	94.4%	86.3%	89.8%	90.8%	92.5%