Abstract:
With the remarkable progress of multimodal large language models (MLLMs) in perception tasks, enhancing their reasoning capability has become a key research focus. Chain-of-Thought (CoT) reasoning improves reasoning transparency and output interpretability, and extending this paradigm to the multimodal domain is expected to enhance the reasoning ability of MLLMs. This paper presents a systematic review of multimodal CoT. It first analyzes the background and theoretical motivations from the perspectives of technological evolution and task demands. Then, it introduces the major approaches to enabling CoT reasoning in MLLMs, including the CoT paradigm, post-training phase, and inference phase. Furthermore, it summarizes existing benchmarks and evaluation metrics, and explores potential application scenarios of multimodal CoT. Finally, the paper discusses the current challenges in this field and outlines future research directions.