华东师范大学学报(自然科学版) ›› 2025, Vol. 2025 ›› Issue (6): 46-52.doi: 10.3969/j.issn.1000-5641.2025.06.006

• • 上一篇    下一篇

面向心智理论发展的视频问答研究

毛媛媛1, 林欣1,*(), 倪琴2, 邓赐平3, 马毅鸣1   

  1. 1. 华东师范大学 计算机科学与技术学院, 上海 200062
    2. 上海外国语大学 国际教育学院, 上海 201620
    3. 华东师范大学 心理与认知学院, 上海 200062
  • 收稿日期:2024-02-19 出版日期:2025-11-25 发布日期:2025-11-29
  • 通讯作者: 林欣 E-mail:xlin@cs.ecnu.edu.cn
  • 基金资助:
    国家自然科学基金 (2021ZD0111000, 2021ZD0111004); 上海市科委项目 (21511100101, 22511105901, 22DZ2229004)

Research on video question answer for the development of theory of mind

Yuanyuan MAO1, Xin LIN1,*(), Qin NI2, Ciping DENG3, Yiming MA1   

  1. 1. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
    2. School of International Education, Shanghai International Studies University, Shanghai 201620, China
    3. School of Psychology and Cognitive Science, East China Normal University, Shanghai 200062, China
  • Received:2024-02-19 Online:2025-11-25 Published:2025-11-29
  • Contact: Xin LIN E-mail:xlin@cs.ecnu.edu.cn

摘要:

近年来随着机器心智理论不断发展, 研究发现, 机器心智理论发展与儿童心智理论发展的三角模型有很大不同. 因此, 提出了面向机器心智理论发展的三角模型, 该三角模型描述了机器心智理论过程中各个工具之间的关系. 依照该三角模型, 提出了适用于机器心智理论发展的评测数据集, 可以用于机器心智理论的动态测评. 最后, 设计了一个专门用于认知推理的视频问答模型——FOMemNet, 该模型主要用于解决信念、愿望和意图推理. 在认知推理任务中, 模型需要以观察者的角度来进行推理, FOMemNet能通过视觉事实和观察者感知编码模块来获得视觉事实特征和观察者特征. 此外, 模型利用多角度融合模块和两个记忆模块对两个角度的特征进行融合以获得全局表示. FOMemNet在BDIQA(Belief, Desire and Intention Question Answer)数据集上的准确率提升了2.27%. 实验表明, 事实和观察者的概念能有效提高视频问答的认知推理的能力.

关键词: 人工智能, 机器认知评测, 多模态

Abstract:

In recent years, with the continuous development of machine theory of mind (ToM), research has found that the development of machine ToM differs significantly from the triangular model of children’s ToM development. Consequently, we propose a machine-oriented theory of mind triangular model. This model elucidates the relationships among various tools in the process of developing machine ToM. Additionally, we introduce an evaluation dataset suitable for the dynamic assessment of machine ToM. Finally, this paper designs a VideoQA(video question answer) model, named FOMemNet (fact and observer memory network), specifically tailored for cognitive reasoning—a model addressing belief, desire, and intention reasoning. Considering that models in cognitive reasoning tasks need to infer from the observer’s perspective, we incorporate the FOEM (vision fact and observer perception encoder module) module in FOMemNet for the fusion of multimodal features, thereby obtaining visual factual features and observer features. Subsequently, the model utilizes the FOF (fact and observer fusion) module and two memory modules to integrate features from both perspectives for obtaining a global representation. FOMemNet results in a 2.27% improvement of BDIQA. Our experiments demonstrate the effectiveness of the concept of fact and observer perception in enhancing cognitive reasoning abilities in VideoQA.

Key words: artificial intelligence, machine cognition evaluation, multimodality

中图分类号: