J* E* C* N* U* N* S* ›› 2025, Vol. 2025 ›› Issue (5): 53-65.doi: 10.3969/j.issn.1000-5641.2025.05.006

• AI-Enabled Open Source Technologies and Applications • Previous Articles     Next Articles

Research on challenges and optimization of large multimodal model applications in treefall scenarios

Lei FENG1(), Chaonan LI1, Chunjie SHENG2,*(), Yuxing SHI2, Yicheng HUANG1, Jianhong JIN1, Yun XU1, Yuzhou DU1, Nina ZHOU1, Sihao MIAO1   

  1. 1. Hangzhou OpenPie Technology Development Co. Ltd., Hangzhou 310000, China
    2. The Government Affairs Service Management Office of Pinghu City, Pinghu, Zhejiang 314200, China
  • Received:2025-01-15 Accepted:2025-08-05 Online:2025-09-25 Published:2025-09-25
  • Contact: Chunjie SHENG E-mail:ray.von@openpie.com;214539069@qq.com

Abstract:

To address the limited robustness of large multimodal models (LMMs) in complex visual scenarios, such as identifying responsibility for fallen trees, which emanates from their reliance on single-path reasoning. This study proposes a novel reasoning optimization method based on Beam Search Chain-of-Thought (BS-CoT). Conventional models often fall into a “first-impression” trap, in which an initial incorrect inference leads to an irreversible analytical failure. The proposed BS-CoT method counteracts this by exploring and evaluating multiple potential inference paths in parallel. It maintains a diverse set of hypotheses about the scene, continuously pruning less likely hypotheses, which effectively overcomes the tendency to commit to a single, fallacious line of reasoning. This significantly enhances visual decision-making capabilities in complex and noisy environments. To validate its efficacy, we constructed a specialized dataset capturing a wide array of treefall incidents in urban governance. Experimental results demonstrated that the proposed method achieved substantial improvements in both event recall and key information capture rates compared with baseline models. This research not only provides a reliable technical solution for visual decision-making challenges in urban public safety but also introduces a new, more robust paradigm for improving the reasoning reliability of large models in critical applications.

Key words: large multimodal model, social governance, AI agent

CLC Number: