1 minute read

Meta info.

TL; DR

MLLMs๊ฐ€ cognitive visual reasoning ํ•˜๋„๋ก ํ•™์Šตํ•˜๋Š” DeepPerception ์ œ์•ˆ+ Knowledge-Intensive Visual Grounding task ์†Œ๊ฐœ (+ KVG-Bench ๊ณต๊ฐœ)

image.png

image.png

image.png

image.png

Background

MLLM์ด ์•„๋Š” ๊ฑด ๋งŽ์•„๋ณด์—ฌ๋„ Visual Reasoning์€ ์ž˜ ์•ˆ ๋จ.

  • ๋‹จ์ˆœ zs CoT Prompting์œผ๋กœ๋Š” ์ง€์‹ ๋ฐ ๋ถ„์„์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•˜๋Š” ์‹œ๊ฐ์ถ”๋ก (=Cognitive Visual Perception)๊นŒ์ง€๋Š” ๋ชปํ•จ
  • Fine-grained Visual Perception์„ ํ•˜๋ ค๋ฉด ์ „๋ฌธ ์ง€์‹์„ ๊ฒฐํ•ฉํ•ด์„œ ํ•™์Šตํ•  ํ•„์š”

Problem States

Knowledge-Intensive Visual Grounding

  • ๊ธฐ์กด visual grounding + expert-level knowledge + fine-grained perception

Suggestions

DeepPerception

  • KVG-Bench(๋ฐ์ดํ„ฐ ์ƒ์„ฑ): ๊ธฐ์กด FGVC ๋ฐ์ดํ„ฐ์…‹ ๊ธฐ๋ฐ˜ knowledge-aligned ํ•™์Šต๋ฐ์ดํ„ฐ ๊ตฌ์ถ•
    • ์ƒ˜ํ”Œ๋‹จ์œ„๋กœ ๋ชจ๋ธ์ด ๋ถ„์„ํ• ๋งŒํ•œ ๋ฐ์ดํ„ฐ๋กœ ๋ณต์žกํ•˜๊ฒŒ ๊ตฌ์„ฑ
    • e.g., ํ•œ ๊ฐœ์˜ ์ด๋ฏธ์ง€์— ๋™์ผ ์นดํ…Œ๊ณ ๋ฆฌ object ์—ฌ๋Ÿฌ๊ฐœ (๊ฐ•์•„์ง€ - ๋ถˆ๋…, ๋น„๊ธ€, โ€ฆ.) ๋ชจ๋ธ์ด ์ฐจ์ด๋ฅผ ๋น„๊ตํ•˜๋„๋ก ์œ ๋„
    • 10-domain, 1.3K-sample, 531-image, 882-entity
  • DeepPerception: 2-stage training framework
    • SFT w/CoT reasoning: CoT๋กœ ๋‹จ๊ณ„์ ์ธ ์‚ฌ๊ณ ๋ฅผ ๋ฐฐ์šฐ๋„๋ก ์œ ๋„
    • RL for Perception-Cognition Synergy: ๊ณต๊ฐ„์ •๋ ฌ๋ณด์ƒ(IoU ๊ธฐ๋ฐ˜) ๊ณผ format reward๋ฅผ ์„ค๊ณ„ํ•ด์„œ ์‹œ๊ฐ์  ์ฐจ์ด๋ฅผ ์„ธ๋ฐ€ํ•˜๊ฒŒ ๋ณด๋„๋ก ์œ ๋„ (GRPO)

Effects

๊ธฐ์กด ๋ชจ๋ธ๋“ค์€ memorize์— ์˜์กดํ•œ๋‹ค๋ฉด deepperception์€ ์ง€์‹์„ ์ œ๋Œ€๋กœ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„

  • KVG-Bench results:ย Table 1
    • backbone: InternVL2 / Qwen2-VL ๋“ฑ 7B
    • ๋น„๊ต: baseline / SFT / SFT+RL
    • DeepPerception์€ baseline Qwen2-VL-7B๋ณด๋‹ค 8.08% ํ–ฅ์ƒ
    • ๊ธฐ์กด ๋ชจ๋ธ๋“ค์€ ood ์„ฑ๋Šฅ ์‹ฌ๊ฐํ•œ๋ฐ ์ œ์•ˆ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•˜๋ฉด unseen ๋„๋ฉ”์ธ์—์„œ๋„ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚จ
    • YOLO-World, G-DINO-1.6-Pro, DINO-X์ฒ˜๋Ÿผ Object detection ์ „๋ฌธ ๋ชจ๋ธ๋ณด๋‹ค๋„ ์„ฑ๋Šฅ ์ข‹์•˜์Œ
  • FGVR results:ย Table 2
    • dataset: FGVC-Aircraft (๋น„ํ–‰๊ธฐ ์ข…๋ฅ˜) Stanford-Cars (์ฐจ ๋ชจ๋ธ ์ข…๋ฅ˜) ๋“ฑ
    • baseline: LLaVA 1.5/Phi-3-Vision/Idefics2/Finedefics/Qwen2VL-7B
    • FT Qwen2-VL-7B๋ณด๋‹ค ํ‰๊ท  3.64% ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”๋ฐ ๋‹จ์ˆœ ์ด๋ฏธ์ง€ Classification์ด์ƒ์œผ๋กœ ์ธ์ง€์  ๋ถ„์„์„ ํ•ด์„œ๋ผ๊ณ  ์ฃผ์žฅ
  • MMBench, MMMU results:ย Table 3
    • dataset: MMBench-V1.1test, MMMUval, AI2D, MathVision
    • baseline: Qwen2-VL-7B
    • Qwen2VL-7B์ˆ˜์ค€ ๋ฐฉ์–ด (๋“œ๋ž๋˜์ง€ ์•Š๋Š”๋‹ค ์ •๋„)
  • ablation- 2stage ํšจ๊ณผย Table 4
    • +CoT-SFT > +2.69%, +GRPO > +5.39%

Personal note. ์šฉ์–ด๊ฐ€ ์‚ด์ง ํ—ท๊ฐˆ๋ ค์„œ ์ฒ˜์Œ์— ์ฝ๋Š”๋ฐ ํ—ค๋งธ๋Š”๋ฐ

  • Visual Reasoning: ์ด๋ฏธ์ง€ ๋ณด๊ณ  ๋‹ตํ•˜๊ธฐ
  • Cognitive Visual Perception: +์–ด๋–ค ๊ทผ๊ฑฐ๋กœ ๊ฒฐ๋ก ์„ ๋‚ด๋ ธ๋Š”์ง€ ๋ถ„์„ ๋ฐ ์‚ฌ๊ณ  ๊ณผ์ •์„ ๋ฐํ˜€์„œ ๋‹ตํ•˜๊ธฐ

์ž‘๋…„ ํ•˜๋ฐ˜๊ธฐ์— ๋ณด๋˜ ๊ทธ visual commonsense์ชฝ ์ด์•ผ๊ธฐ๋ฅผ ํ•ด์ฃผ๋Š”์ค„ ์•Œ๊ณ  ๋ดค๋Š”๋ฐ ๊ทธ๋Ÿฐ ๋‚ด์šฉ์€ ์•„๋‹ˆ์—ˆ์–ด์š”. ๋‹ค๋งŒ ๋งŽ์ด ์ฃผ๋ชฉ๋ฐ›์€ ๋…ผ๋ฌธ์ธ ๊ฒƒ ๊ฐ™์•„์„œ ๋“œ๋žํ•˜์ง€ ์•Š๊ณ  ์‚ดํŽด๋ดค๋Š”๋ฐ, NLP์—์„œ๋Š” CoT-SFT ํ•˜๊ณ  RL ๋ถ™์—ฌ์ฃผ๋Š”๊ฒŒ ํ†ต์ƒ์ ์ธ ํ๋ฆ„์ธ๊ฑฐ๊ฐ™์€๋ฐ multi-modal reasoning์—์„œ๋Š” ์‹œ๋„๋˜์ง€ ์•Š์•˜๋‚˜๋ณด๋„ค์š”. ์„ฑ๋Šฅ์ด ๋งŽ์ด ์ข‹์•„์ง€๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์—ฌ์š”.