从感知到推理：深度思考赋能多模态大语言模型

祝文鑫; 陈安东; 宋宇宸; 陈科海; 朱聪慧; 陈自岩; 赵铁军

doi:10.3969/j.issn.1003-0077.2026.05.001

从感知到推理：深度思考赋能多模态大语言模型

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

摘要

摘要: 随着多模态大语言模型在感知任务中取得显著成果，如何提升其复杂推理能力成为研究重点。当前模型存在推理路径不透明、泛化能力不足等问题，而思维链推理有助于增强模型的推理透明性和输出可解释性，且已在大语言模型中取得显著效果，将该范式扩展至多模态领域，有望提升模型的推理能力。该文围绕“多模态思维链”展开系统梳理。首先，从技术演进与任务需求角度分析其提出背景与理论动因；其次，从思维链范式、后训练阶段与推理阶段三个方面介绍了当前多模态思维链的主流方法，并分析了其内在机理。此外，该文还总结了现有评测基准与指标，并探讨了多模态思维链的应用场景。最后，分析了当前多模态思维链面临的挑战并展望其未来发展方向。

Abstract: With the remarkable progress of multimodal large language models (MLLMs) in perception tasks, enhancing their reasoning capability has become a key research focus. Chain-of-Thought (CoT) reasoning improves reasoning transparency and output interpretability, and extending this paradigm to the multimodal domain is expected to enhance the reasoning ability of MLLMs. This paper presents a systematic review of multimodal CoT. It first analyzes the background and theoretical motivations from the perspectives of technological evolution and task demands. Then, it introduces the major approaches to enabling CoT reasoning in MLLMs, including the CoT paradigm, post-training phase, and inference phase. Furthermore, it summarizes existing benchmarks and evaluation metrics, and explores potential application scenarios of multimodal CoT. Finally, the paper discusses the current challenges in this field and outlines future research directions.

HTML全文

参考文献(198)

施引文献

资源附件(0)