华东师范大学学报(自然科学版) ›› 2025, Vol. 2025 ›› Issue (5): 53-65.doi: 10.3969/j.issn.1000-5641.2025.05.006

• AI赋能的开源技术与应用 • 上一篇    下一篇

树木倒伏场景中多模态大模型的应用挑战与优化研究

冯雷1(), 李超楠1, 盛春杰2,*(), 施宇星2, 黄奕铖1, 金剑虹1, 许韵1, 杜聿洲1, 周妮娜1, 缪思好1   

  1. 1. 杭州拓数派科技有限公司, 杭州 310000
    2. 平湖市政务服务管理办公室, 浙江 平湖 314200
  • 收稿日期:2025-01-15 接受日期:2025-08-05 出版日期:2025-09-25 发布日期:2025-09-25
  • 通讯作者: 盛春杰 E-mail:ray.von@openpie.com;214539069@qq.com
  • 作者简介:冯 雷, 男, MBA客座教授, 研究方向为数据计算系统. E-mail: ray.von@openpie.com

Research on challenges and optimization of large multimodal model applications in treefall scenarios

Lei FENG1(), Chaonan LI1, Chunjie SHENG2,*(), Yuxing SHI2, Yicheng HUANG1, Jianhong JIN1, Yun XU1, Yuzhou DU1, Nina ZHOU1, Sihao MIAO1   

  1. 1. Hangzhou OpenPie Technology Development Co. Ltd., Hangzhou 310000, China
    2. The Government Affairs Service Management Office of Pinghu City, Pinghu, Zhejiang 314200, China
  • Received:2025-01-15 Accepted:2025-08-05 Online:2025-09-25 Published:2025-09-25
  • Contact: Chunjie SHENG E-mail:ray.von@openpie.com;214539069@qq.com

摘要:

针对多模态大模型在处理如树木倒伏等复杂视觉场景时, 因依赖单路径推理而导致的决策鲁棒性不足问题, 提出了一种基于束搜索思维链 (Beam Search Chain-of-Thought, BS-CoT) 的推理优化方法. 该方法通过并行探索和评估多条潜在的推理路径, 有效克服了传统模型易陷入单一错误逻辑的缺陷, 显著增强了模型在复杂场景下的视觉决策能力. 为验证该方法的有效性, 构建了一个面向城市治理中树木倒伏场景的专用数据集. 实验结果表明, 与基线模型相比, 本方法在事件召回率和关键信息捕获率上均有显著提升. 本研究不仅为解决城市公共安全领域的视觉决策难题提供了可靠的技术方案, 也为提升大模型在关键任务中的推理可靠性提供了新的范式.

关键词: 多模态大模型, 社会治理, 智能体

Abstract:

To address the limited robustness of large multimodal models (LMMs) in complex visual scenarios, such as identifying responsibility for fallen trees, which emanates from their reliance on single-path reasoning. This study proposes a novel reasoning optimization method based on Beam Search Chain-of-Thought (BS-CoT). Conventional models often fall into a “first-impression” trap, in which an initial incorrect inference leads to an irreversible analytical failure. The proposed BS-CoT method counteracts this by exploring and evaluating multiple potential inference paths in parallel. It maintains a diverse set of hypotheses about the scene, continuously pruning less likely hypotheses, which effectively overcomes the tendency to commit to a single, fallacious line of reasoning. This significantly enhances visual decision-making capabilities in complex and noisy environments. To validate its efficacy, we constructed a specialized dataset capturing a wide array of treefall incidents in urban governance. Experimental results demonstrated that the proposed method achieved substantial improvements in both event recall and key information capture rates compared with baseline models. This research not only provides a reliable technical solution for visual decision-making challenges in urban public safety but also introduces a new, more robust paradigm for improving the reasoning reliability of large models in critical applications.

Key words: large multimodal model, social governance, AI agent

中图分类号: