Research on challenges and optimization of large multimodal model applications in treefall scenarios

doi:10.3969/j.issn.1000-5641.2025.05.006

J* E* C* N* U* N* S* ›› 2025, Vol. 2025 ›› Issue (5): 53-65.doi: 10.3969/j.issn.1000-5641.2025.05.006

• AI-Enabled Open Source Technologies and Applications • Previous Articles Next Articles

Research on challenges and optimization of large multimodal model applications in treefall scenarios

Lei FENG¹(), Chaonan LI¹, Chunjie SHENG²^,*(), Yuxing SHI², Yicheng HUANG¹, Jianhong JIN¹, Yun XU¹, Yuzhou DU¹, Nina ZHOU¹, Sihao MIAO¹

1. Hangzhou OpenPie Technology Development Co. Ltd., Hangzhou　310000, China
2. The Government Affairs Service Management Office of Pinghu City, Pinghu, Zhejiang　314200, China

Received:2025-01-15 Accepted:2025-08-05 Online:2025-09-25 Published:2025-09-25
Contact: Chunjie SHENG E-mail:ray.von@openpie.com;214539069@qq.com

Abstract

Abstract:

To address the limited robustness of large multimodal models (LMMs) in complex visual scenarios, such as identifying responsibility for fallen trees, which emanates from their reliance on single-path reasoning. This study proposes a novel reasoning optimization method based on Beam Search Chain-of-Thought (BS-CoT). Conventional models often fall into a “first-impression” trap, in which an initial incorrect inference leads to an irreversible analytical failure. The proposed BS-CoT method counteracts this by exploring and evaluating multiple potential inference paths in parallel. It maintains a diverse set of hypotheses about the scene, continuously pruning less likely hypotheses, which effectively overcomes the tendency to commit to a single, fallacious line of reasoning. This significantly enhances visual decision-making capabilities in complex and noisy environments. To validate its efficacy, we constructed a specialized dataset capturing a wide array of treefall incidents in urban governance. Experimental results demonstrated that the proposed method achieved substantial improvements in both event recall and key information capture rates compared with baseline models. This research not only provides a reliable technical solution for visual decision-making challenges in urban public safety but also introduces a new, more robust paradigm for improving the reasoning reliability of large models in critical applications.

Key words: large multimodal model, social governance, AI agent

CLC Number:

TP315

Lei FENG, Chaonan LI, Chunjie SHENG, Yuxing SHI, Yicheng HUANG, Jianhong JIN, Yun XU, Yuzhou DU, Nina ZHOU, Sihao MIAO. Research on challenges and optimization of large multimodal model applications in treefall scenarios[J]. J* E* C* N* U* N* S*, 2025, 2025(5): 53-65.

Figures/Tables 8

Fig.1

Table 1

Fig.2

Fig.3

Fig.4

Table 2

Table 3

Fig.5

References 9

1	ZHONG T, LIU Z, PAN Y, et al. Evaluation of OpenAI o1: Opportunities and challenges of AGI [EB/OL]. (2024-09-27)[2025-01-10]. https://arxiv.org/abs/2409.18486.
2	Meta. Llama-3.2-11B-Vision-instruct [EB/OL]. (2024-09-25)[2025-01-10]. https://docs.api.nvidia.com/nim/reference/meta-llama-3_2-11b-vision-instruct.
3	WANG Y, CHEN W, HAN X, et al. Exploring the reasoning abilities of multimodal large language models (MLLMs): A comprehensive survey on emerging trends in multimodal reasoning [EB/OL]. (2024-01-10)[2025-01-10]. https://arxiv.org/abs/2401.06805.
4	MITRA C, HUANG B, DARRELL T, et al. Compositional Chain-of-Thought prompting for large multimodal models [EB/OL]. (2023-11-27)[2025-01-10]. https://arxiv.org/abs/2311.17076.
5	LI Z, LIU D, ZHANG C, et al. Enhancing advanced visual reasoning ability of large language models [EB/OL]. (2024-09-21)[2025-01-10]. https://arxiv.org/abs/2409.13980.
6	XU G, JIN P, LI H, et al. LLaVA-CoT: Let vision language models reason step-by-step [EB/OL]. (2024-11-15)[2025-01-10]. https://arxiv.org/abs/2411.10440.
7	东莞市城市管理和综合执法局. 东莞市园林绿化突发公共事件应急预案 [EB/OL]. (2010-10-27)[2025-01-10]. https://dgcg.dg.gov.cn/gkmlpt/content/0/132/post_132435.html#489.
8	MACQUEEN J B. Some methods for classification and analysis of multivariate observations [C]// Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley: University of California Press, 1967: 281-297.
9	ZHANG T, KISHORE V, WU F, et al. BERTScore: Evaluating text generation with BERT [EB/OL]. (2019-04-21)[2025-01-10]. https://arxiv.org/abs/1904.09675.

树木分类	责任主体
公共绿地树木	园林绿化部门或城市道路、公路、河道有关主管部门
单位所属绿地树木	所属单位
居住区、居住小区内树木	业主, 业主可以委托物业、企业进行管护
建设工程范围内保留的树木	建设期间由建设单位负责
铁路、湖泊、水库等用地范围内的绿地树木	各有关主管部门
村庄绿地树木	村民委员会或者村集体经济组织
私人种植的树木	所有人或者管理人
集体土地上的树木	一般归属于集体组织, 私自种植的树木所有权归种树者, 但需向集体返还不当得利并承担可能的侵权责任

名称	具体配置
处理器 (CPU)	Intel(R) Xeon(R) Gold 6426Y CPU@2.5G Hz
显卡 (GPU)	NVIDIA RTX 4090 × 2 (48 GB)
深度学习框架	PyTorch 2.7.1
语言版本	Python 3.12

模型	BertScore-P	BertScore-R	BertScore-F₁	SIM
LLaVA	0.8714	0.8332	0.8563	0.7268
LLaVA-CoT	0.9202	0.9073	0.9178	0.8597
BS-CoT(beam_size=2)	0.9231	0.9067	0.9212	0.9046
BS-CoT(beam_size=3)	0.9293	0.9105	0.9296	0.9121
BS-CoT(beam_size=4)	0.9312	0.9094	0.9283	0.9137

Research on challenges and optimization of large multimodal model applications in treefall scenarios

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 9

Related Articles 1

Recommended Articles

Metrics

Comments