Advanced Search
ZHU Wenxin, CHEN Andong, SONG Yuchen, CHEN Kehai, ZHU Conghui, CHEN Ziyan, ZHAO Tiejun. From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language ModelsJ. Journal of Chinese Information Processing, 2026, 40(5): 1-25. DOI: 10.3969/j.issn.1003-0077.2026.05.001
Citation: ZHU Wenxin, CHEN Andong, SONG Yuchen, CHEN Kehai, ZHU Conghui, CHEN Ziyan, ZHAO Tiejun. From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language ModelsJ. Journal of Chinese Information Processing, 2026, 40(5): 1-25. DOI: 10.3969/j.issn.1003-0077.2026.05.001

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

  • With the remarkable progress of multimodal large language models (MLLMs) in perception tasks, enhancing their reasoning capability has become a key research focus. Chain-of-Thought (CoT) reasoning improves reasoning transparency and output interpretability, and extending this paradigm to the multimodal domain is expected to enhance the reasoning ability of MLLMs. This paper presents a systematic review of multimodal CoT. It first analyzes the background and theoretical motivations from the perspectives of technological evolution and task demands. Then, it introduces the major approaches to enabling CoT reasoning in MLLMs, including the CoT paradigm, post-training phase, and inference phase. Furthermore, it summarizes existing benchmarks and evaluation metrics, and explores potential application scenarios of multimodal CoT. Finally, the paper discusses the current challenges in this field and outlines future research directions.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return