ATBench: Benchmark for evaluating analysis trajectories in end-to-end data analysis

doi:10.3969/j.issn.1000-5641.2025.05.005

Abstract

Abstract:

This paper introduces ATBench, a benchmark designed for evaluating analysis trajectories in end-to-end data analysis tasks, to address the limitations in granularity and domain coverage present in current benchmarks. Analysis trajectories represent the process in which an agent iteratively poses questions, derives insights, and formulates conclusions around a specific analysis goal via iterative interactions. Leveraging both existing benchmarks and real Kaggle task data, we constructed 151 evaluation datasets spanning eight distinct domains by employing an annotation strategy that balances goal-driven and exploratory approaches. Additionally, we propose a fine-grained evaluation metric, the analysis trajectory score, to assess an agent's coherent analytical capabilities during end-to-end data analysis tasks. Experimental results demonstrate that ATBench exhibits strong stability and discriminative power, effectively distinguishing performance differences among models in analytical tasks. The results also reveal the limitations in agents’ abilities for coherent analysis and insight discovery, thereby providing data-driven support for future improvements.

Key words: agent, data analysis, benchmark

CLC Number:

TP181

Xufei WANG, Huarong XU, Panfeng CHEN, Mei CHEN, Dan MA, Zhengxi CHEN, Xu TIAN, Hui LI. ATBench: Benchmark for evaluating analysis trajectories in end-to-end data analysis[J]. J* E* C* N* U* N* S*, 2025, 2025(5): 43-52.

Figures/Tables 6

Fig.1

Table 1

Fig.2

Table 2

Fig.3

Table 3

References 25

1	DHANYA D, KUMAR S S, THILAGAVATHY A, et al. Data analytics and artificial intelligence in the circular economy: Case studies [M]// MISHRA B K. Intelligent Engineering Applications and Applied Sciences for Sustainability. Hershey: IGI Global, 2023: 40-58.
2	AWAN U, SHAMIM S, KHAN Z, et al.. Big data analytics capability and decision-making: The role of data-driven insight on circular economy performance. Technological Forecasting and Social Change, 2021, 168, 120766.
3	COLSO E. What AI-driven decision making looks like [EB/OL]. (2019-07-08)[2025-05-31]. https://hbr.org/2019/07/what-ai-driven-decision-making-looks-like.
4	BEAN R. Why becoming a data-driven organization is so hard [EB/OL]. (2022-02-24)[2025-05-31]. https://hbr.org/2022/02/why-becoming-a-data-driven-organization-is-so-hard.
5	TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and efficient foundation language models [EB/OL]. (2023-02-27)[2025-05-31]. https://arxiv.org/abs/2302.13971.
6	ACHAIM J, ADLER S, AGARWAL S, et al. GPT-4 technical report [EB/OL]. (2023-03-15)[2025-05-31]. https://arxiv.org/abs/2303.08774v1.
7	陈郅睿,陆雪松.. 基于开源代码大语言模型提示的学生代码修复. 华东师范大学学报(自然科学版), 2024, (5): 93- 103.
8	QIAO B, LI L Q, ZHANG X, et al. Taskweaver: A code-first agent framework [EB/OL]. (2024-06-20)[2025-05-31]. https://arxiv.org/abs/2311.17541v3.
9	HONG S, LIN Y, LIU B, et al. Data interpreter: An LLM agent for data science [EB/OL]. (2024-02-28)[2025-05-31]. https://arxiv.org/abs/2402.18679v4.
10	MA P C, DING R, WANG S, et al. InsightPilot: An LLM-empowered automated data exploration system [C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023: 346-352.
11	WENG L X, WANG X B, LU J Y, et al. InsightLens: Discovering and exploring insights from conversational contexts in large-language-model-powered data analysis [EB/OL]. (2024-04-02)[2025-05-31]. https://arxiv.org/abs/2404.01644v1.
12	LIU X, WU Z R, WU X Q, et al. Are LLMs capable of data-based statistical and causal reasoning? Benchmarking advanced quantitative reasoning with data [C]// Findings of the Association for Computational Linguistics: ACL 2024. 2024: 9215–9235.
13	HE X Y, ZHOU M Y, XU X R, et al.. Text2Analysis: A benchmark of table question answering with advanced data analysis and unclear queries. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38 (16): 18206- 18215.
14	HU X Y, ZHAO Z Y, WEI S, et al. InfiAgent-DABench: Evaluating agents on data analysis tasks [C]// Proceedings of the 41st International Conference on Machine Learning. 2024: 19544–19572.
15	SAHU G, PURI A, RODRIGUEZ J, et al. InsightBench: Evaluating business analytics agents through multi-step insight generation [EB/OL]. (2024-04-02)[2025-05-31]. https://arxiv.org/abs/2407.06423v1.
16	ZHANG D, ZHOUBIAN S N, CAI M, et al. DataSciBench: An LLM agent benchmark for data science [EB/OL]. (2025-02-19)[2025-05-31]. https://arxiv.org/abs/2502.13897.
17	JING L, HUANG Z, WANG X, et al. Dsbench: How far are data science agents to becoming data science experts? [EB/OL]. (2024-09-12)[2025-05-31]. https://arxiv.org/abs/2409.07703v1.
18	SIMON H A, NEWELL A.. Human problem solving: The state of the theory in 1970. American psychologist, 1971, 26 (2): 145- 159.
19	SILBERZAHN R, UHLMANN E L, MARTIN D P, et al.. Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 2018, (3): 337- 356.
20	GU K, SHANG R X, JIANG R E, et al. BLADE: Benchmarking language model agents for data-driven science [C]// Findings of the Association for Computational Linguistics: EMNLP 2024. 2024: 13936–13971.
21	LIN C Y. Rouge: A package for automatic evaluation of summaries [C]// Text Summarization Branches Out. 2004: 74-81.
22	BANERJEE S, LAVIE A. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments [C]// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005: 65-72.
23	LIU Y, ITER D, XU Y C, et al. G-eval: NLG evaluation using GPT-4 with better human alignment [C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023: 2511–2522.
24	WU Q Y, BANSAL G, ZHANG J Y, et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation [EB/OL]. (2023-08-16)[2025-05-31]. https://arxiv.org/abs/2308.08155.
25	ZHOU X H, SUN Z Y, LI G L.. DB-GPT: Large language model meets database. Data Science and Engineering, 2024, 9 (1): 102- 111.

评估基准	任务类型	评估指标
QRData	单轮分析	准确率
Text2Analysis	单轮分析	通过率回归评分可执行代码比例
DataSciBench	单轮分析	完成率成功率多任务加权平均得分
InfiAgent-DABench	单轮分析	准确率
DSBench	单轮分析	任务级准确率任务成功率相对性能差距
InsightBench	多轮分析	洞察得分摘要得分
ATBench (ours)	多轮分析	洞察得分摘要得分分析轨迹得分

智能体	基座模型	$ I_{{\mathrm{score}}} $	$ S_{{\mathrm{score}}} $	$ T_{{\mathrm{score}}} $
AutoGen	Qwen2.5-32B-Instruct	0.2958	0.2775	0.2651
	Qwen2.5-72B-Instruct	0.3648	0.4200	0.3250
	DeepSeek-V3	0.3783	0.4086	0.3476
	GPT-4.1-Mini	0.4032	0.4252	0.3707
AgentPoirot	Qwen2.5-32B-Instruct	0.3740	0.4212	0.3336
	Qwen2.5-72B-Instruct	0.3968	0.4384	0.3599
	DeepSeek-V3	0.3915	0.4344	0.3548
	GPT-4.1-Mini	0.4454	0.4735	0.4113

智能体	基座模型	$ T_{{\mathrm{score}}} $ (均值 ± 标准差)
AutoGen	Qwen2.5-32B-Instruct	0.2681 ± 0.0038
	Qwen2.5-72B-Instruct	0.3284 ± 0.0038
	DeepSeek-V3	0.3479 ± 0.0011
	GPT-4.1-Mini	0.3721 ± 0.0017
AgentPoirot	Qwen2.5-32B-Instruct	0.3374 ± 0.0036
	Qwen2.5-72B-Instruct	0.3663 ± 0.0043
	DeepSeek-V3	0.3565 ± 0.0051
	GPT-4.1-Mini	0.4140± 0.0037

[1]	Ruiyang PANG, Xuesong LU. Interactive data structure and algorithm visualization based on AI agents [J]. J* E* C* N* U* N* S*, 2025, 2025(5): 32-42.
[2]	Xiaowei CHEN, Wei WANG, Fanyu HAN, Guanglei BAO, Fei DONG, Hao HUO, Chen LIU. OSS Insight: A platform for open source ecosystem spatiotemporal data analysis and insights [J]. J* E* C* N* U* N* S*, 2025, 2025(5): 170-182.
[3]	Wenfeng TAN, Xiao JIA, Xun YANG, Xin LI, Yifan GUO, Yuanpeng CHENG, Shanfa TANG. Preparation and performance evaluation of Ca²⁺ ion treatment agent for oilfield drilling wastewater [J]. J* E* C* N* U* N* S*, 2025, 2025(4): 114-123.
[4]	Shuhong YOU, Qian SU, Rong ZHANG. Dynamic simulation for cloud database runtime environment [J]. J* E* C* N* U* N* S*, 2022, 2022(5): 73-89.
[5]	SHI Bing, XIA Fan, SONG Shu-bin, XIAO Li-min, DONG Qi-wen, ZHOU Ao-ying, XU Lin-hao. Design and implementation of operation subsystem in graduate student management system [J]. Journal of East China Normal University(Natural Sc, 2017, 2017(5): 225-235.
[6]	LI Liang, WU Gang, LIU Hui-Lin, WANG Guo-Ren. Design and implementation of a database benchmark visualization tool VisualDBBench and application in mainmemory databases [J]. Journal of East China Normal University(Natural Sc, 2014, 2014(5): 340-350.
[7]	LI Ye, XIA Fan, QIAN Wei-Ning. Benchmarking continuous queries over social streams [J]. Journal of East China Normal University(Natural Sc, 2014, 2014(5): 330-339.
[8]	KANG Qiang-Qiang, JIN Che-Qing, ZHANG Zhao, HU Hua-Liang, ZHOU Ao-Ying. How to evaluate inmemory database objectively [J]. Journal of East China Normal University(Natural Sc, 2014, 2014(5): 320-329.
[9]	BIAN Hao-Qiong, CHEN Yue-Guo, DU Xiao-Yong, GAO Yan-Jie. Equi-join optimization on spark [J]. Journal of East China Normal University(Natural Sc, 2014, 2014(5): 261-270.
[10]	GAO Li-peng, LIU Kai, LUO Chun-hua, LI Jian-qi,WANG Yi-ting, PENG Hui. Preparation and properties study of a new type of T1-T2 dual mode MRI contrast agent [J]. Journal of East China Normal University(Natural Sc, 2013, 2013(5): 102-109.
[11]	LI Hua-zhi;LI Xiu-yan;HU Qi-ping;HAN Bo-bo. Development and study of the thermophilic microbial agent for disposing the food waste and its degradation properties [J]. Journal of East China Normal University(Natural Sc, 2011, 2011(2): 126-133.
[12]	REN Ming;BI Yu;WANG Cheng-dao. Multicast Asynchrony Web Services in Distributed GIS(Chinese) [J]. Journal of East China Normal University(Natural Sc, 2005, 2005(2): 65-71.