Research on video question answer for the development of theory of mind

doi:10.3969/j.issn.1000-5641.2025.06.006

Abstract

Abstract:

In recent years, with the continuous development of machine theory of mind (ToM), research has found that the development of machine ToM differs significantly from the triangular model of children’s ToM development. Consequently, we propose a machine-oriented theory of mind triangular model. This model elucidates the relationships among various tools in the process of developing machine ToM. Additionally, we introduce an evaluation dataset suitable for the dynamic assessment of machine ToM. Finally, this paper designs a VideoQA(video question answer) model, named FOMemNet (fact and observer memory network), specifically tailored for cognitive reasoning—a model addressing belief, desire, and intention reasoning. Considering that models in cognitive reasoning tasks need to infer from the observer’s perspective, we incorporate the FOEM (vision fact and observer perception encoder module) module in FOMemNet for the fusion of multimodal features, thereby obtaining visual factual features and observer features. Subsequently, the model utilizes the FOF (fact and observer fusion) module and two memory modules to integrate features from both perspectives for obtaining a global representation. FOMemNet results in a 2.27% improvement of BDIQA. Our experiments demonstrate the effectiveness of the concept of fact and observer perception in enhancing cognitive reasoning abilities in VideoQA.

Key words: artificial intelligence, machine cognition evaluation, multimodality

CLC Number:

TP184

Yuanyuan MAO, Xin LIN, Qin NI, Ciping DENG, Yiming MA. Research on video question answer for the development of theory of mind[J]. J* E* C* N* U* N* S*, 2025, 2025(6): 46-52.

Figures/Tables 4

Fig.1

Table 1

Table 2

Table 3

References 18

1	PREMACK D, WOODRUFF G.. Does the chimpanzee have a theory of mind?. Behavioral and Brain Sciences, 1978 (4): 515- 526.
2	SHU T, BHANDWALDAR A, GAN C, et al. AGENT: A benchmark for core psychological reasoning [C]// International Conference on Machine Learning. 2021: 9614-9625.
3	BISWAS-DIENER R, DIENER E. Theory of mind [EB/OL]. (2021-09-13) [2024-01-02]. https://nobaproject.com/modules/theory-of-mind.
4	WIMMER H, PERNER J.. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 1983, 13 (1): 103- 128.
5	KIM J, MA M, KIM K, et al. Gaining extra supervision via multi-task learning for multi-modal video question answering [C]// 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019: 1-8.
6	WANG A R, LUU A T, FOO C S, et al. Holistic multi-modal memory network for movie question answering [J]. IEEE Transactions on Image Processing, 2019, 29: 489-499.
7	GAO J Y, GE R Z, CHEN K, et al. Motion-appearance co-memory networks for video question answering [C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018: 6576-6585.
8	GARCIA N, NAKASHIMA Y. Knowledge-based video question answering with unsupervised scene descriptions [C]// Computer Vision – ECCV 2020. 2020: 581-598.
9	WANG J Y, BAO B K, XU C S.. DualVGR: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia, 2021, 24, 3369- 3380.
10	YANG A, MIECH A, SIVIC J, et al. Zero-shot video question answering via frozen bidirectional language models [C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. ACM, 2022: 124-141.
11	PUIG X, SHU T, LI S, et al. Watch-and-help: A challenge for social perception and human-ai collaboration [EB/OL]. (2021-05-03)[2024-01-05]. https://arxiv.org/pdf/2010.09890.
12	MAO Y, LIN X, NI Q, et al. BDIQA: a new dataset for video question answering to explore cognitive reasoning through theory of mind [C]// Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 2024: 583-591.
13	BAE W, YOO J, YE J C. Beyond deep residual learning for image restoration: Persistent homology-guided manifold simplification [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2017: 1141-1149.
14	YANG Z K, GARCIA N, CHU C H, et al. BERT representations for video question answering [C]// 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2020: 1556-1565.
15	LE T M, LE V, VENKATESH S, et al. Hierarchical conditional relation networks for video question answering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 9972-9981.
16	FAN C Y, ZHANG X F, ZHANG S, et al. Heterogeneous memory enhanced multimodal attention model for video question answering [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019: 1999-2007.
17	JIANG P, HAN Y H.. Reasoning with heterogeneous graph alignment for video question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34 (7): 11109- 11116.
18	CHEN G Y, LIU X, WANG G R, et al. Tem-adapter: Adapting image-text pretraining for video question answer [C]// 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023: 13945-13955.

基本概念	目标	例子
愿望	等级1: 愿望到人类是按照愿望行动的等级2: 愿望可能不被满足	一个人进入厨房做饭, 最后吃到食物一个人进入厨房做饭, 但最终没有吃到食物
意图	等级1: 人类按照愿望行动, 制订计划来满足自己的愿望等级2: 意识到不同的人会制订不同的计划来实现自己的愿望	一个人想要吃饭, 他进入厨房先从冰箱拿到食物, 再用微波炉做饭一个人想要吃饭, 他可能用微波炉加热食物, 也可能用炉子加热食物, 也可能简单吃小零食
信念	等级1: 推理人类的正确信念等级2: 意识到人可能会产生错误的信念	小明认为这本书在哪里 (与事实一致) 小明认为这本书在哪里 (与事实不一致)

模型	总体准确率/%	认知准确率/%	感知准确率/%	每类问题总体准确率/%
模型	总体准确率/%	认知准确率/%	感知准确率/%	愿望	意图	信念	判断题
Comem^[7]	69.71	70.62	65.67	71.39	67.28	72.28	71.05
HCRN^[15]	67.59	66.37	65.67	65.44	62.50	69.95	72.25
HME^[16]	61.86	59.84	55.31	67.42	54.78	56.48	72.49
Dual^[9]	66.93	68.25	63.22	65.16	67.65	71.50	66.99
HGA^[17]	57.80	52.62	63.49	28.33	60.29	69.43	62.84
Temp^[18]	67.48	66.77	64.66	76.78	48.16	70.89	72.01
Frozen^[10]	69.69	67.36	64.03	83.00	42.28	70.73	80.62
FOMemNet	71.98	72.98	67.03	76.77	68.01	73.06	73.92

消融实验	总体准确率/%	认知准确率/%	感知准确率/%	每类问题总体准确率/%
消融实验	总体准确率/%	认知准确率/%	感知准确率/%	愿望	意图	信念	判断题
FOMemNet	71.98	72.98	67.03	76.77	68.01	73.06	73.92
w/o FOF	70.99	71.91	65.12	76.20	66.18	72.02	73.92
w/o FOEM	70.43	70.72	65.94	70.82	67.65	72.80	73.68
w/o ITM	71.21	71.91	67.85	75.07	68.01	71.76	72.49
Baseline	69.71	70.62	65.67	71.39	67.28	72.28	71.05

[1]	Lijun XU, Li YANG, Ziyi HUANG. Synergy between large language models and open source ecosystems in AI education [J]. J* E* C* N* U* N* S*, 2025, 2025(5): 66-75.
[2]	Xudong REN, Zhipeng HUANG, Jiaheng PENG, Wei WANG. Liquidity design for ecological industries in the large language model era: Analysis of liquidity elements represented by open-source communities [J]. J* E* C* N* U* N* S*, 2025, 2025(5): 25-31.
[3]	Yunhu ZHAO, Yuzhou YANG, Lin QIN. Brief discussion on fair use for distribution of open-source large model datasets [J]. J* E* C* N* U* N* S*, 2025, 2025(5): 183-190.
[4]	Ge GAO, Huiqi HU. FeaDB: In-memory based multi-version online feature store [J]. Journal of East China Normal University(Natural Science), 2023, 2023(5): 65-76.
[5]	CHEN Liang, GUO Jia-wen, WU Jian-gong, WANG Zhan-quan, SHI Ling. Research on artificial intelligence assisted decision-making algorithms for lawyers based on legal-computing theory [J]. Journal of East China Normal University(Natural Sc, 2019, 2019(5): 85-99.
[6]	YANG Kang, HANG Ding-jiang, GAO Ming. A review of machine reading comprehension for automatic QA [J]. Journal of East China Normal University(Natural Sc, 2019, 2019(5): 36-52.
[7]	GUO Qian-yu, CHEN You-guang. Recursive algorithm for NoGo based on value evaluation [J]. Journal of East China Normal University(Natural Sc, 2019, 2019(1): 58-65.