华东师范大学学报(自然科学版) ›› 2025, Vol. 2025 ›› Issue (5): 43-52.doi: 10.3969/j.issn.1000-5641.2025.05.005

• AI赋能的开源技术与应用 • 上一篇    下一篇

ATBench: 面向端到端数据分析任务的分析轨迹评估基准

王旭飞1,2, 许华容1,2, 陈攀峰1,2, 陈梅1,2, 马丹1,2,*(), 陈正曦1,2, 田旭1,2, 李晖1,2   

  1. 1. 公共大数据国家重点实验室, 贵阳 550025
    2. 贵州大学 计算机科学与技术学院, 贵阳 550025
  • 收稿日期:2025-06-27 出版日期:2025-09-25 发布日期:2025-09-25
  • 通讯作者: 马丹 E-mail:dma@gzu.edu.cn
  • 基金资助:
    国家自然科学基金(62162010, 72161005); 国家重点研发计划(2023YFC3341202, 2023YFC3341205)

ATBench: Benchmark for evaluating analysis trajectories in end-to-end data analysis

Xufei WANG1,2, Huarong XU1,2, Panfeng CHEN1,2, Mei CHEN1,2, Dan MA1,2,*(), Zhengxi CHEN1,2, Xu TIAN1,2, Hui LI1,2   

  1. 1. State Key Laboratory of Public Big Data, Guiyang 550025, China
    2. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
  • Received:2025-06-27 Online:2025-09-25 Published:2025-09-25
  • Contact: Dan MA E-mail:dma@gzu.edu.cn

摘要:

提出了一个用于端到端数据分析任务中分析轨迹的评估基准ATBench, 以弥补现有评估基准在粒度细节和领域覆盖方面的不足. 分析轨迹是智能体围绕分析目标, 在多轮交互中持续提出问题、生成洞察, 最终形成总结的分析链. 通过结合已有评估基准和Kaggle平台的真实任务数据, 采取目标驱动与探索驱动相结合的标注策略, 构建了151个评估数据集, 涵盖8个领域. 此外, 提出了一个细粒度的评估指标: 分析轨迹得分$ T_{{\mathrm{score}}} $, 用于量化智能体在执行端到端数据分析任务过程中的连贯分析能力. 实验结果显示, ATBench具备较高的稳定性与判别效能, 能够可靠区分不同模型在端到端数据分析任务中的性能差异. 同时, 该基准揭示了智能体在连贯分析和洞察发现方面的不足, 可为后续智能体优化提供数据支持.

关键词: 智能体, 数据分析, 评估基准

Abstract:

This paper introduces ATBench, a benchmark designed for evaluating analysis trajectories in end-to-end data analysis tasks, to address the limitations in granularity and domain coverage present in current benchmarks. Analysis trajectories represent the process in which an agent iteratively poses questions, derives insights, and formulates conclusions around a specific analysis goal via iterative interactions. Leveraging both existing benchmarks and real Kaggle task data, we constructed 151 evaluation datasets spanning eight distinct domains by employing an annotation strategy that balances goal-driven and exploratory approaches. Additionally, we propose a fine-grained evaluation metric, the analysis trajectory score, to assess an agent's coherent analytical capabilities during end-to-end data analysis tasks. Experimental results demonstrate that ATBench exhibits strong stability and discriminative power, effectively distinguishing performance differences among models in analytical tasks. The results also reveal the limitations in agents’ abilities for coherent analysis and insight discovery, thereby providing data-driven support for future improvements.

Key words: agent, data analysis, benchmark

中图分类号: