ATBench: 面向端到端数据分析任务的分析轨迹评估基准

doi:10.3969/j.issn.1000-5641.2025.05.005

华东师范大学学报（自然科学版） ›› 2025, Vol. 2025 ›› Issue (5): 43-52.doi: 10.3969/j.issn.1000-5641.2025.05.005

• AI赋能的开源技术与应用 • 上一篇下一篇

ATBench: 面向端到端数据分析任务的分析轨迹评估基准

王旭飞¹^,², 许华容¹^,², 陈攀峰¹^,², 陈梅¹^,², 马丹¹^,²^,*(), 陈正曦¹^,², 田旭¹^,², 李晖¹^,²

1. 公共大数据国家重点实验室, 贵阳　550025
2. 贵州大学计算机科学与技术学院, 贵阳　550025

收稿日期:2025-06-27 出版日期:2025-09-25 发布日期:2025-09-25
通讯作者: 马丹 E-mail:dma@gzu.edu.cn
基金资助:
国家自然科学基金(62162010, 72161005); 国家重点研发计划(2023YFC3341202, 2023YFC3341205)

ATBench: Benchmark for evaluating analysis trajectories in end-to-end data analysis

Xufei WANG¹^,², Huarong XU¹^,², Panfeng CHEN¹^,², Mei CHEN¹^,², Dan MA¹^,²^,*(), Zhengxi CHEN¹^,², Xu TIAN¹^,², Hui LI¹^,²

1. State Key Laboratory of Public Big Data, Guiyang　550025, China
2. College of Computer Science and Technology, Guizhou University, Guiyang　550025, China

Received:2025-06-27 Online:2025-09-25 Published:2025-09-25
Contact: Dan MA E-mail:dma@gzu.edu.cn

摘要/Abstract

摘要：

提出了一个用于端到端数据分析任务中分析轨迹的评估基准ATBench, 以弥补现有评估基准在粒度细节和领域覆盖方面的不足. 分析轨迹是智能体围绕分析目标, 在多轮交互中持续提出问题、生成洞察, 最终形成总结的分析链. 通过结合已有评估基准和Kaggle平台的真实任务数据, 采取目标驱动与探索驱动相结合的标注策略, 构建了151个评估数据集, 涵盖8个领域. 此外, 提出了一个细粒度的评估指标: 分析轨迹得分$ T_{{\mathrm{score}}} $, 用于量化智能体在执行端到端数据分析任务过程中的连贯分析能力. 实验结果显示, ATBench具备较高的稳定性与判别效能, 能够可靠区分不同模型在端到端数据分析任务中的性能差异. 同时, 该基准揭示了智能体在连贯分析和洞察发现方面的不足, 可为后续智能体优化提供数据支持.

关键词: 智能体, 数据分析, 评估基准

Abstract:

This paper introduces ATBench, a benchmark designed for evaluating analysis trajectories in end-to-end data analysis tasks, to address the limitations in granularity and domain coverage present in current benchmarks. Analysis trajectories represent the process in which an agent iteratively poses questions, derives insights, and formulates conclusions around a specific analysis goal via iterative interactions. Leveraging both existing benchmarks and real Kaggle task data, we constructed 151 evaluation datasets spanning eight distinct domains by employing an annotation strategy that balances goal-driven and exploratory approaches. Additionally, we propose a fine-grained evaluation metric, the analysis trajectory score, to assess an agent's coherent analytical capabilities during end-to-end data analysis tasks. Experimental results demonstrate that ATBench exhibits strong stability and discriminative power, effectively distinguishing performance differences among models in analytical tasks. The results also reveal the limitations in agents’ abilities for coherent analysis and insight discovery, thereby providing data-driven support for future improvements.

Key words: agent, data analysis, benchmark

中图分类号:

TP181

王旭飞, 许华容, 陈攀峰, 陈梅, 马丹, 陈正曦, 田旭, 李晖. ATBench: 面向端到端数据分析任务的分析轨迹评估基准[J]. 华东师范大学学报（自然科学版）, 2025, 2025(5): 43-52.

Xufei WANG, Huarong XU, Panfeng CHEN, Mei CHEN, Dan MA, Zhengxi CHEN, Xu TIAN, Hui LI. ATBench: Benchmark for evaluating analysis trajectories in end-to-end data analysis[J]. J* E* C* N* U* N* S*, 2025, 2025(5): 43-52.

图/表 6

图1

表1

图2

表2

图3

表3

参考文献 25

1	DHANYA D, KUMAR S S, THILAGAVATHY A, et al. Data analytics and artificial intelligence in the circular economy: Case studies [M]// MISHRA B K. Intelligent Engineering Applications and Applied Sciences for Sustainability. Hershey: IGI Global, 2023: 40-58.
2	AWAN U, SHAMIM S, KHAN Z, et al.. Big data analytics capability and decision-making: The role of data-driven insight on circular economy performance. Technological Forecasting and Social Change, 2021, 168, 120766.
3	COLSO E. What AI-driven decision making looks like [EB/OL]. (2019-07-08)[2025-05-31]. https://hbr.org/2019/07/what-ai-driven-decision-making-looks-like.
4	BEAN R. Why becoming a data-driven organization is so hard [EB/OL]. (2022-02-24)[2025-05-31]. https://hbr.org/2022/02/why-becoming-a-data-driven-organization-is-so-hard.
5	TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and efficient foundation language models [EB/OL]. (2023-02-27)[2025-05-31]. https://arxiv.org/abs/2302.13971.
6	ACHAIM J, ADLER S, AGARWAL S, et al. GPT-4 technical report [EB/OL]. (2023-03-15)[2025-05-31]. https://arxiv.org/abs/2303.08774v1.
7	陈郅睿,陆雪松.. 基于开源代码大语言模型提示的学生代码修复. 华东师范大学学报(自然科学版), 2024, (5): 93- 103.
8	QIAO B, LI L Q, ZHANG X, et al. Taskweaver: A code-first agent framework [EB/OL]. (2024-06-20)[2025-05-31]. https://arxiv.org/abs/2311.17541v3.
9	HONG S, LIN Y, LIU B, et al. Data interpreter: An LLM agent for data science [EB/OL]. (2024-02-28)[2025-05-31]. https://arxiv.org/abs/2402.18679v4.
10	MA P C, DING R, WANG S, et al. InsightPilot: An LLM-empowered automated data exploration system [C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023: 346-352.
11	WENG L X, WANG X B, LU J Y, et al. InsightLens: Discovering and exploring insights from conversational contexts in large-language-model-powered data analysis [EB/OL]. (2024-04-02)[2025-05-31]. https://arxiv.org/abs/2404.01644v1.
12	LIU X, WU Z R, WU X Q, et al. Are LLMs capable of data-based statistical and causal reasoning? Benchmarking advanced quantitative reasoning with data [C]// Findings of the Association for Computational Linguistics: ACL 2024. 2024: 9215–9235.
13	HE X Y, ZHOU M Y, XU X R, et al.. Text2Analysis: A benchmark of table question answering with advanced data analysis and unclear queries. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38 (16): 18206- 18215.
14	HU X Y, ZHAO Z Y, WEI S, et al. InfiAgent-DABench: Evaluating agents on data analysis tasks [C]// Proceedings of the 41st International Conference on Machine Learning. 2024: 19544–19572.
15	SAHU G, PURI A, RODRIGUEZ J, et al. InsightBench: Evaluating business analytics agents through multi-step insight generation [EB/OL]. (2024-04-02)[2025-05-31]. https://arxiv.org/abs/2407.06423v1.
16	ZHANG D, ZHOUBIAN S N, CAI M, et al. DataSciBench: An LLM agent benchmark for data science [EB/OL]. (2025-02-19)[2025-05-31]. https://arxiv.org/abs/2502.13897.
17	JING L, HUANG Z, WANG X, et al. Dsbench: How far are data science agents to becoming data science experts? [EB/OL]. (2024-09-12)[2025-05-31]. https://arxiv.org/abs/2409.07703v1.
18	SIMON H A, NEWELL A.. Human problem solving: The state of the theory in 1970. American psychologist, 1971, 26 (2): 145- 159.
19	SILBERZAHN R, UHLMANN E L, MARTIN D P, et al.. Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 2018, (3): 337- 356.
20	GU K, SHANG R X, JIANG R E, et al. BLADE: Benchmarking language model agents for data-driven science [C]// Findings of the Association for Computational Linguistics: EMNLP 2024. 2024: 13936–13971.
21	LIN C Y. Rouge: A package for automatic evaluation of summaries [C]// Text Summarization Branches Out. 2004: 74-81.
22	BANERJEE S, LAVIE A. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments [C]// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005: 65-72.
23	LIU Y, ITER D, XU Y C, et al. G-eval: NLG evaluation using GPT-4 with better human alignment [C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023: 2511–2522.
24	WU Q Y, BANSAL G, ZHANG J Y, et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation [EB/OL]. (2023-08-16)[2025-05-31]. https://arxiv.org/abs/2308.08155.
25	ZHOU X H, SUN Z Y, LI G L.. DB-GPT: Large language model meets database. Data Science and Engineering, 2024, 9 (1): 102- 111.

评估基准	任务类型	评估指标
QRData	单轮分析	准确率
Text2Analysis	单轮分析	通过率回归评分可执行代码比例
DataSciBench	单轮分析	完成率成功率多任务加权平均得分
InfiAgent-DABench	单轮分析	准确率
DSBench	单轮分析	任务级准确率任务成功率相对性能差距
InsightBench	多轮分析	洞察得分摘要得分
ATBench (ours)	多轮分析	洞察得分摘要得分分析轨迹得分

智能体	基座模型	$ I_{{\mathrm{score}}} $	$ S_{{\mathrm{score}}} $	$ T_{{\mathrm{score}}} $
AutoGen	Qwen2.5-32B-Instruct	0.2958	0.2775	0.2651
	Qwen2.5-72B-Instruct	0.3648	0.4200	0.3250
	DeepSeek-V3	0.3783	0.4086	0.3476
	GPT-4.1-Mini	0.4032	0.4252	0.3707
AgentPoirot	Qwen2.5-32B-Instruct	0.3740	0.4212	0.3336
	Qwen2.5-72B-Instruct	0.3968	0.4384	0.3599
	DeepSeek-V3	0.3915	0.4344	0.3548
	GPT-4.1-Mini	0.4454	0.4735	0.4113

智能体	基座模型	$ T_{{\mathrm{score}}} $ (均值 ± 标准差)
AutoGen	Qwen2.5-32B-Instruct	0.2681 ± 0.0038
	Qwen2.5-72B-Instruct	0.3284 ± 0.0038
	DeepSeek-V3	0.3479 ± 0.0011
	GPT-4.1-Mini	0.3721 ± 0.0017
AgentPoirot	Qwen2.5-32B-Instruct	0.3374 ± 0.0036
	Qwen2.5-72B-Instruct	0.3663 ± 0.0043
	DeepSeek-V3	0.3565 ± 0.0051
	GPT-4.1-Mini	0.4140± 0.0037

[1]	庞瑞洋, 陆雪松. 基于智能体的可交互数据结构和算法可视化实现[J]. 华东师范大学学报（自然科学版）, 2025, 2025(5): 32-42.
[2]	陈小伟, 王伟, 韩凡宇, 包光磊, 董菲, 霍昊, 刘辰. OSS Insight: 开源生态时空数据分析和智能洞察平台[J]. 华东师范大学学报（自然科学版）, 2025, 2025(5): 170-182.
[3]	史兵, 夏帆, 宋树彬, 肖李敏, 董启文, 周傲英, 徐林昊. 研究生信息平台中运维系统的设计与实现[J]. 华东师范大学学报(自然科学版), 2017, 2017(5): 225-235.
[4]	金培莉, 王晓震. 校园一卡通系统决策支持实例分析[J]. 华东师范大学学报(自然科学版), 2015, 2015(S1): 525-.
[5]	卞昊穹, 陈跃国, 杜小勇, 高彦杰. Spark上的等值连接优化[J]. 华东师范大学学报(自然科学版), 2014, 2014(5): 261-270.

ATBench: 面向端到端数据分析任务的分析轨迹评估基准

ATBench: Benchmark for evaluating analysis trajectories in end-to-end data analysis

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 25

相关文章 5

编辑推荐

Metrics

本文评价