J* E* C* N* U* N* S* ›› 2025, Vol. 2025 ›› Issue (6): 46-52.doi: 10.3969/j.issn.1000-5641.2025.06.006

Previous Articles     Next Articles

Research on video question answer for the development of theory of mind

Yuanyuan MAO1, Xin LIN1,*(), Qin NI2, Ciping DENG3, Yiming MA1   

  1. 1. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
    2. School of International Education, Shanghai International Studies University, Shanghai 201620, China
    3. School of Psychology and Cognitive Science, East China Normal University, Shanghai 200062, China
  • Received:2024-02-19 Online:2025-11-25 Published:2025-11-29
  • Contact: Xin LIN E-mail:xlin@cs.ecnu.edu.cn

Abstract:

In recent years, with the continuous development of machine theory of mind (ToM), research has found that the development of machine ToM differs significantly from the triangular model of children’s ToM development. Consequently, we propose a machine-oriented theory of mind triangular model. This model elucidates the relationships among various tools in the process of developing machine ToM. Additionally, we introduce an evaluation dataset suitable for the dynamic assessment of machine ToM. Finally, this paper designs a VideoQA(video question answer) model, named FOMemNet (fact and observer memory network), specifically tailored for cognitive reasoning—a model addressing belief, desire, and intention reasoning. Considering that models in cognitive reasoning tasks need to infer from the observer’s perspective, we incorporate the FOEM (vision fact and observer perception encoder module) module in FOMemNet for the fusion of multimodal features, thereby obtaining visual factual features and observer features. Subsequently, the model utilizes the FOF (fact and observer fusion) module and two memory modules to integrate features from both perspectives for obtaining a global representation. FOMemNet results in a 2.27% improvement of BDIQA. Our experiments demonstrate the effectiveness of the concept of fact and observer perception in enhancing cognitive reasoning abilities in VideoQA.

Key words: artificial intelligence, machine cognition evaluation, multimodality

CLC Number: