5 minute read

Meta info.
  • Authors: Rhea Chowers, Udi Barzelay, Oshri Naparstek, Yair Weiss
  • Paper: https://arxiv.org/abs/2603.29080
  • Affiliation: Hebrew University of Jerusalem, IBM Research
  • Published: March 30, 2026

TL; DR

CLIP ๊ณ„์—ด multi-modal contrastive model์˜ modality gap์ด robustness๋ฅผ ์ €ํ•˜์‹œํ‚ค๋Š” bug๋ผ๋Š” ํ”„๋ ˆ์ด๋ฐ์„ ์ด๋ก ์ ์œผ๋กœ ์ฆ๋ช…, clean accuracy ์†์‹ค ์—†์ด robustness๋ฅผ ๊ฐœ์„ ํ•˜๋Š” post-processing ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ œ์•ˆ

Slide 11 Figure 1: CLIP MS-COCO embedding ๋ถ„ํฌ ๋ฐ caption rephrasing์— ์˜ํ•œ misclassification ์˜ˆ์‹œ Figure 2: gap ์กฐ์ ˆ alpha์— ๋”ฐ๋ฅธ downstream ์„ฑ๋Šฅ ๋ณ€ํ™” Figure 3: near-zero loss ์†”๋ฃจ์…˜๋“ค Figure 4: training dynamics์—์„œ gap ํ˜•์„ฑ ๊ณผ์ • Figure 5: S_i^y ๋ถ„ํฌ์™€ gap ๋ฐฉํ–ฅ ๋ถ„์‚ฐ ์••์ถ• ๊ณผ์ • Figure 7: modality gap๊ณผ robustness ๊ด€๊ณ„ illustration Figure 8: Gaussian noise ํ•˜์—์„œ gap ์กฐ์ ˆ์— ๋”ฐ๋ฅธ accuracy vs robustness Figure 9: quantization robustness Figure 10: text rephrasing์—์„œ์˜ accuracy ๋ณ€ํ™” Figure 14: ๋‹ค์–‘ํ•œ noise distribution์—์„œ์˜ robustness ๊ฐœ์„ 

Background

  • multi-modal contrastive model๋“ค์€ image-text pair๋ฅผ ๊ฐ™์€ embedding space์— align๋˜๋„๋ก ํ•™์Šตํ•˜๋‚˜, modality gap ๋ฐœ์ƒ
    • contrastive loss: ๊ฐ™์€ pair์˜ embedding์€ ๊ฐ€๊น๊ฒŒ, ๋‹ค๋ฅธ pair๋Š” ๋ฉ€๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹
      • ๋Œ€ํ‘œ ๋ชจ๋ธ: CLIP, SigLIP, MetaCLIP ๋“ฑ
    • image embedding ๋ถ„ํฌ์™€ text embedding ๋ถ„ํฌ๊ฐ€ unit hypersphere ์œ„์—์„œ ๋ช…ํ™•ํžˆ ๋ถ„๋ฆฌ๋œ ์ƒํƒœ๋กœ ์กด์žฌ Fig 1 (left)
      • ํ•™์Šต ๋ชฉํ‘œ(๋‘ modality๋ฅผ ๊ฒน์น˜๊ฒŒ ๋งŒ๋“ค๊ธฐ)์™€ ์ •๋ฉด์œผ๋กœ ๋ชจ์ˆœ๋˜๋Š” ํ˜„์ƒ
  • ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์˜ ์„ค๋ช… ์‹œ๋„ โ€” ๊ทธ๋Ÿฌ๋‚˜ ์–ด๋А ์„ค๋ช…๋„ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š๋‹ค๋Š” ๋ฐ˜๋ก€ ์กด์žฌ
    • information imbalance: ์ด๋ฏธ์ง€ ํ•œ ์žฅ์— ์—ฌ๋Ÿฌ caption์ด ๋Œ€์‘๋  ์ˆ˜ ์žˆ์–ด ํ…์ŠคํŠธ๊ฐ€ ๋” ์ถ”์ƒ์ 
    • dimensionality collapse: contrastive loss ํ•™์Šต ์ค‘ ๊ฐ modality์˜ ๋ถ„์‚ฐ์ด ์ผ๋ถ€ ์ฐจ์›์—๋งŒ ์ง‘์ค‘
    • Gap์˜ downstream ์„ฑ๋Šฅ ์˜ํ–ฅ๋„ ๋ถˆ๋ช…ํ™• โ€” gap์„ ํ‚ค์šฐ๋ฉด ๋‚˜์•„์ง€๋Š” ๊ฒฝ์šฐ๋„, ์ค„์ด๋ฉด ๋‚˜์•„์ง€๋Š” ๊ฒฝ์šฐ๋„ ์žˆ์Œ Fig 2
  • (๋ณ„๊ฐœ๋กœ) CLIP ๊ณ„์—ด ๋ชจ๋ธ์˜ robustness ์ทจ์•ฝ์„ฑ๋„ ์ž˜ ์•Œ๋ ค์ง„ ๋ฌธ์ œ
    • single pixel shift, caption rephrasing ๋“ฑ ์˜๋ฏธ๋ก ์ ์œผ๋กœ ๋™์ผํ•œ ๋ณ€ํ™”์—๋„ ์˜ˆ์ธก์ด ๋‹ฌ๋ผ์ง Fig 1 (right)
    • ์ด ์ทจ์•ฝ์„ฑ์ด modality gap๊ณผ ์—ฐ๊ฒฐ๋œ๋‹ค๋Š” ์ด๋ก ์  ์„ค๋ช…์€ ์—†์Œ

Problem States

robustness ๊ธฐ์ค€์œผ๋กœ ์•„๋ž˜ ๋‘ ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์„ ์‹œ๋„ํ•œ๋‹ค.

  • ์™œ modality gap์ด ์ƒ๊ธฐ๋Š”๊ฐ€?
    • contrastive loss๋ฅผ minimizeํ•˜๋ฉด ๋‘ modality๊ฐ€ ๊ฒน์ณ์•ผ ํ•˜๋Š”๋ฐ, ์‹ค์ œ gradient descent๋Š” gap์ด ๋‚จ์€ ์ฑ„๋กœ ์ˆ˜๋ ด
    • dimensionality collapse๊ฐ€ ์›์ธ์ด๋ผ๋Š” ๊ธฐ์กด ์„ค๋ช…์ด ํ•„์š”์ถฉ๋ถ„์กฐ๊ฑด์ด ์•„๋‹˜
  • Gap์€ feature์ธ๊ฐ€? ์•„๋‹ˆ๋ฉด bug์ธ๊ฐ€?
    • Fig 2์ฒ˜๋Ÿผ gap ํฌ๊ธฐ๋ฅผ ์กฐ์ ˆํ–ˆ์„ ๋•Œ downstream ์„ฑ๋Šฅ ๋ณ€ํ™”๊ฐ€ ๋ชจ๋ธ๋งˆ๋‹ค ์ผ๊ด€์„ฑ ์—†์Œ
    • ์–ด๋–ค ๊ด€์ ์—์„œ gap์„ ํ‰๊ฐ€ํ•ด์•ผ ํ•˜๋Š”์ง€ unified framework ๋ถ€์žฌ

Suggestions

  • modality gap์ด ํด์ˆ˜๋ก embedding perturbation์— ๋Œ€ํ•ด nearest neighbor๊ฐ€ ๋ฐ”๋€Œ๊ธฐ ์‰ฝ๊ณ  โ†’ robustness ์ €ํ•˜ โ†’ gap์€ bug
  • ์ด๋ฅผ ์ด๋ก ์ ์œผ๋กœ ์ฆ๋ช…, ์žฌํ•™์Šต ์—†์ด robustness๋ฅผ ๊ฐœ์„ ํ•˜๋Š” post-processing ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ œ์•ˆ

ํ•ต์‹ฌ ๊ฐœ๋…

  • Local gap: ํ•˜๋‚˜์˜ image-text pair ์‚ฌ์ด์˜ ๋ฒกํ„ฐ
  • Global gap \vec{g}: ๋‘ modality ํ‰๊ท  ๊ฐ„์˜ ๋ฒกํ„ฐ โ€” ๋‘ modality๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋–จ์–ด์ ธ ์žˆ๋Š”์ง€ ๋Œ€ํ‘œ๊ฐ’์œผ๋กœ ํ™•์ธ
  • Global Orthogonality Assumption: ํ•™์Šต์ด ์ˆ˜๋ ดํ•œ ํ›„ \vec{g}๋Š” ๋‘ modality ๋ถ„ํฌ ๋ชจ๋‘์— orthogonal
  • (๊ธฐ์กด ์—ฐ๊ตฌ ๋ฐ˜๋ฐ•) Dimensionality collapse๋Š” ํ•„์š”์ถฉ๋ถ„์กฐ๊ฑด์ด ์•„๋‹˜
    • ๊ธฐ์กด ์—ฐ๊ตฌ: modality ๋‚ด ๋ถ„์‚ฐ์ด ์ผ๋ถ€ ์ฐจ์›์—๋งŒ ์ง‘์ค‘๋˜๋Š” dimensionality collapse๊ฐ€ gap์˜ ์›์ธ์ด๋‹ค
    • ๋ฐ˜๋ฐ•: ์ง„์งœ ์›์ธ = ์ดˆ๊ธฐ cluster ๋ถ„๋ฆฌ + contrastive loss dynamics์˜ ์กฐํ•ฉ
      • isotropic Gaussian ์ดˆ๊ธฐํ™”์—์„œ๋„ gap ๋ฐœ์ƒ Fig 4
      • ์™„์ „ํ•œ dimensionality collapse ์ดˆ๊ธฐํ™”์—์„œ๋Š” ์˜คํžˆ๋ ค gap ์—†์ด ์ˆ˜๋ ด Fig 11

Theorem 3.1: ์™œ Gap์ด ์ƒ๊ธฐ๋‚˜? โ€” Gap ๋ฐฉํ–ฅ์œผ๋กœ ๋ถ„์‚ฐ์ด ๋จผ์ € ์ค„์–ด๋“ ๋‹ค

  • contrastive loss์˜ gradient๋Š” ๋‘ ๊ฐ€์ง€ force๋กœ ๊ตฌ์„ฑ
    • ๊ฐ yi๋ฅผ ๋Œ€์‘๋˜๋Š” xi๋กœ ๋‹น๊ธฐ๋Š” attractive force
    • ๋‹ค๋ฅธ ์ ๋“ค๋กœ๋ถ€ํ„ฐ ๋ฐ€์–ด๋‚ด๋Š” repulsive force
  • gap ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€๊นŒ์šด ์ ๋“ค์€ repulsive force๊ฐ€ attractive force๋ณด๋‹ค ๊ฐ•ํ•ด์ ธ์„œ ๋ฐ€๋ ค๋‚จ
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ ํ•™์Šต ์ดˆ๋ฐ˜์— gap ๋ฐฉํ–ฅ ๋ถ„์‚ฐ์ด ๋จผ์ € ์ถ•์†Œ Fig 4, Fig 5
    • gradient๋Š” ๊ฐ modality ๋‚ด ๋ถ„์‚ฐ์„ gap vector ๋ฐฉํ–ฅ์œผ๋กœ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ž‘๋™
    • ์ฆ‰ yi๋Š” gap ๋ฐฉํ–ฅ์œผ๋กœ ฮผy ์ชฝ์œผ๋กœ ์ด๋™ (๋ถ„์‚ฐ ์••์ถ•)

Theorem 3.2: ์™œ Gap์ด ๋‘ modality ๋ชจ๋‘์— orthogonalํ•˜๊ฒŒ ์ˆ˜๋ ดํ•˜๋Š”๊ฐ€?

  • ๊ฐ€์ •: ์–ด๋А ์‹œ์ ์—์„œ ๋‘ modality์˜ ๋ถ„์‚ฐ์ด ์–ด๋–ค ๋ฐฉํ–ฅ์œผ๋กœ 0์ด ๋˜๊ณ , ์ดํ›„(ํ•™์Šต ํ›„๋ฐ˜๋ถ€) softmax assignment matrix๊ฐ€ doubly stochastic (ํ–‰ ํ•ฉ = ์—ด ํ•ฉ = 1)์— ๊ทผ์‚ฌํ•˜๊ฒŒ ๋˜๋ฉด
    • gradient๊ฐ€ gap ๋ฐฉํ–ฅ์œผ๋กœ๋Š” 0์ด ๋˜์–ด ์ด๋™์ด ๋ฉˆ์ถค
    • ๋‚˜๋จธ์ง€ ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ์ •๋ ฌ์ด ๊ณ„์† ์ง„ํ–‰
  • ๊ฒฐ๊ณผ: \vec{g}๊ฐ€ ๋‘ modality ๋ชจ๋‘์— ์ˆ˜์ง์ธ ์ƒํƒœ๋กœ ์ˆ˜๋ ด Fig 4, Fig 6
    • Fig 4: toy setting์—์„œ์˜ dynamics ์‹œ๊ฐํ™” โ€” ์ดˆ๊ธฐ tight cluster โ†’ gap ๋ฐฉํ–ฅ ๋ถ„์‚ฐ ์••์ถ• โ†’ gap๊ณผ ์ˆ˜์ง์ธ ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ์ •๋ ฌ ์ˆ˜๋ ด
    • Fig 6: ์‹ค์ œ CLIP ํ•™์Šต ๊ณผ์ •์—์„œ๋„ Si^x, Si^y โ†’ 1 ๋ฐ ์ดˆ๊ธฐ cluster ๋ถ„๋ฆฌ ์กฐ๊ฑด ํ™•์ธ

Theorem 3.4: Gap์ด ํด์ˆ˜๋ก Robustness๋ฅผ ๋‚ฎ์ถ˜๋‹ค

  • Robustness ์ •์˜: embedding์— noise๋ฅผ ๊ฐ€ํ–ˆ์„ ๋•Œ nearest neighbor๊ฐ€ ๋ณ€ํ•˜์ง€ ์•Š์„ ํ™•๋ฅ 
  • Orthogonality assumption ํ•˜์—์„œ, y๋ฅผ global gap vector \vec{g} ๋ฐฉํ–ฅ์œผ๋กœ X ์ชฝ์œผ๋กœ ์ด๋™์‹œํ‚ค๋ฉด robustness ์ฆ๊ฐ€
  • Fig 7 ์ง๊ด€: image๊ฐ€ text์—์„œ ๋ฉ€์ˆ˜๋ก (= gap ํด์ˆ˜๋ก)
    • decision boundary์˜ ์ž‘์€ ํšŒ์ „ (์˜ˆ: text embedding์— ์ž‘์€ noise)์—๋„ classification์ด ๋ฐ”๋€Œ๊ธฐ ์‰ฌ์›€
    • image๊ฐ€ text ๊ฐ€๊นŒ์ด ๋ถ™์œผ๋ฉด ๊ฐ™์€ noise์—๋„ ํ›จ์”ฌ ์•ˆ์ •์ 

Theorem 3.5: Gap์„ ์ค„์—ฌ๋„ Clean Accuracy๋Š” ์œ ์ง€๋œ๋‹ค

  • \vec{v}๊ฐ€ modality์˜ affine subspace์— ์ˆ˜์ง์ด๋ฉด, ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ modality๋ฅผ ์ด๋™์‹œ์ผœ๋„ ๋ชจ๋“  ์  ๊ฐ„ ์ƒ๋Œ€ ๊ฑฐ๋ฆฌ ์ˆœ์„œ๊ฐ€ ๋ณด์กด (cross-modal nearest neighbor ๊ตฌ์กฐ ๋ณด์กด)
  • Theorem 3.4 + 3.5 ์กฐํ•ฉ: gap์„ ์ค„์ด๋ฉด robustness๋Š” ๋†’์•„์ง€๊ณ  clean accuracy๋Š” ๊ทธ๋Œ€๋กœ

Post-processing Gap Closure

  • Gap์„ ์ค„์ด๋˜ modality์— ์ˆ˜์ง์ธ ๋ฐฉํ–ฅ ์„ฑ๋ถ„๋งŒ ๊ณจ๋ผ์„œ ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ์ด๋™
    • (1) \vec{g} ์ค‘์—์„œ retrieval ๋Œ€์ƒ modality์˜ ์ฃผ์š” ๋ฐฉํ–ฅ V๋ฅผ PCA๋กœ ๊ณ„์‚ฐ
    • (2) \vec{g}๋ฅผ orthogonal complement์— projection โ€” \vec{g}์—์„œ ๊ทธ ๋ฐฉํ–ฅ ์„ฑ๋ถ„ ์ œ๊ฑฐ
    • (3) modality๋ฅผ \vec{g} ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™
  • ์žฌํ•™์Šต ๋ถˆํ•„์š”, inference ์ „ embedding space์— post-hoc ์ ์šฉ ๊ฐ€๋Šฅ, ๋ชจ๋“  cross-modal nearest neighbor task์— ๋ฒ”์šฉ ์ ์šฉ ๊ฐ€๋Šฅ

Effects

  • Experimental setup
    • ๋ชจ๋ธ: CLIP (ViT-B/16, ViT-L/14), SigLIP, MetaCLIP
    • ๋ฐ์ดํ„ฐ์…‹: ImageNet, CIFAR10/100, MS-COCO, A-OKVQA (multiple choice VQA)
    • embedding ์ƒ์„ฑ: openclip ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
    • noise ์ข…๋ฅ˜: Gaussian noise ($\eta \sim \mathcal{N}(0, \sigma^2 I)$), quantization noise, text rephrasing
    • ํ‰๊ฐ€: zero-shot classification accuracy, R@1 (retrieval), VQA accuracy, robustness (nearest neighbor ์œ ์ง€์œจ)
  • Results
    • Fig 8 Controlled Gaussian noise: gap์„ ์ค„์ผ์ˆ˜๋ก robustness ๋‹จ์กฐ ์ฆ๊ฐ€, accuracy๋Š” ๊ฑฐ์˜ ๋ณ€ํ™” ์—†์Œ
      • CLIP, SigLIP ๋“ฑ ์—ฌ๋Ÿฌ ๋ชจ๋ธ + CIFAR10/100, A-OKVQA ๋“ฑ ์—ฌ๋Ÿฌ task์—์„œ ์ผ๊ด€๋œ ํŒจํ„ด
      • Theorem 3.5 ์˜ˆ์ธก๊ณผ ์ •ํ™•ํžˆ ์ผ์น˜
    • Fig 9 Quantization noise: gap์„ ์ค„์ธ ๋’ค quantizeํ•˜๋ฉด robustness ๋Œ€ํญ ๊ฐœ์„ 
      • RAG์ฒ˜๋Ÿผ embedding์„ ๋ฏธ๋ฆฌ ์ €์žฅํ•˜๋Š” ์„ธํŒ…์—์„œ ์‹ค์šฉ์  ์˜๋ฏธ ์žˆ์Œ
      • quantization noise๋Š” zero-mean์ด ์•„๋‹ ์ˆ˜ ์žˆ์–ด quantized space์—์„œ gap์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ด ๋” ํšจ๊ณผ์ 
    • Fig 10 Text rephrasing: ์ด๋ก  ๊ฐ€์ •(zero-mean, uncorrelated noise)์„ ๋งŒ์กฑํ•˜์ง€ ์•Š๋Š” ์„ธํŒ…์ด์ง€๋งŒ
      • gap์„ ์ค„์ด๋ฉด A-OKVQA rephrasing accuracy ์œ ์˜๋ฏธํ•˜๊ฒŒ ํ–ฅ์ƒ, clean accuracy ์œ ์ง€
    • Fig 14 ๋‹ค์–‘ํ•œ noise distribution: Studentโ€™s-t, Uniform, Laplacian ๋ถ„ํฌ์—์„œ๋„ robustness ์ผ๊ด€๋˜๊ฒŒ ๊ฐœ์„ 
      • zero-mean + ์ฐจ์› ๊ฐ„ uncorrelated ์กฐ๊ฑด๋งŒ ๋งŒ์กฑํ•˜๋ฉด ๋ถ„ํฌ ์ข…๋ฅ˜ ๋ฌด๊ด€ํ•˜๊ฒŒ Theorem 3.4 ์„ฑ๋ฆฝ

Personal note. kakao ์˜จ๋ผ์ธ ๋ฐ‹์—…์—์„œ ์™œ ์ตœ์‹ ์˜ ์˜ด๋‹ˆ๋ชจ๋ธ๋“ค์ด ์ถœ๋ ฅ์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š์€์ง€ ์งˆ์˜๋“œ๋ ธ๋‹ค๊ฐ€ ๊ถ๊ธˆํ•ด์„œ ์ฐพ์•„๋ณธ ์ตœ์‹  ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค. modality gap์ด ์˜ค๋žซ๋™์•ˆ ๊ด€์ฐฐ๋๋Š”๋ฐ๋„ โ€œ์™œโ€์™€ โ€œ๋ฌด์—‡โ€์— ๋Œ€ํ•œ ์ด๋ก ์  ์„ค๋ช…์ด ์—†์—ˆ๋‹ค๋Š” ๊ฒŒ ์ƒˆ์‚ผ์Šค๋Ÿฝ์Šต๋‹ˆ๋‹ค๋งŒ, ์—ฐ๊ตฌ์—์„œ๋Š” robustness๋ผ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋ช…ํ™•ํ•œ ๋ Œ์ฆˆ(?)๋ฅผ ๋„์ž…ํ•˜๋ฉด์„œ gap์„ bug๋กœ ๊ทœ์ •, ๊ทธ ์ด์œ ๋ฅผ training dynamics๋กœ ์„ค๋ช…ํ•˜๋Š” ๋ฐ๊นŒ์ง€ ์ด์–ด์ง€๋Š” ํ๋ฆ„์ด ๊น”๋”ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ gap์„ ์ค„์—ฌ๋„ nearest neighbor ๊ตฌ์กฐ๊ฐ€ ์œ ์ง€๋œ๋‹ค (Theorem 3.5) ๊ฐ€ ํ•ต์‹ฌ์ด๊ณ , ์ด๊ฒŒ ์žˆ์–ด์•ผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ •๋‹นํ™”๋ฉ๋‹ˆ๋‹ค. post-processing๋งŒ์œผ๋กœ robustness๋ฅผ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒฐ๋ก ์€ ์‹ค์šฉ์ ์œผ๋กœ๋„ ๋งค๋ ฅ์ ์ด์ง€๋งŒ, rephrasing ๊ฐ™์€ input-space noise๋Š” ์ด๋ก  ๊ฐ€์ •์„ ๊นจ๋Š”๋ฐ๋„ ๊ฒฝํ—˜์ ์œผ๋กœ ์ž˜ ๋™์ž‘ํ•œ๋‹ค๋Š” ๊ฒƒ ๋ญ”๊ฐ€ ํ ๊ฐ™๊ธฐ๋„ ํ•˜๋ฉด์„œ๋„ ์ €๋Š” ์˜คํžˆ๋ ค ์ด ์—ญ์‹œ ์‹ค์šฉ์ ์ด๋ผ๋Š” ์ธ์ƒ์ž…๋‹ˆ๋‹ค. ์ œ ๊ด€์‹ฌ์€, ์ด ๋ฌธ์ œ๊ฐ€ CLIP์—๋งŒ ๊ตญํ•œ๋œ ๊ฑด ์•„๋‹Œ๋ฐ, ์ด๋ก ์ ์œผ๋กœ๋Š” contrastive loss๋กœ ํ•™์Šต๋œ ๋ชจ๋“  multi-modal model์— ํ•ด๋‹นํ•˜๊ณ  ์‹ค์ œ๋กœ SigLIP, MetaCLIP ๋“ฑ ๊ณ„์—ด์—์„œ๋„ ๋™์ผํ•˜๊ฒŒ ํ™•์ธ๋ฉ๋‹ˆ๋‹ค. LLaVA ๊ฐ™์€ generative ๊ณ„์—ด์€ ์ด๋ก  ์ ์šฉ ๋Œ€์ƒ ๋ฐ–์ด๊ธด ํ•œ๋ฐ ๋‚ด๋ถ€์ ์œผ๋กœ CLIP encoder๋ฅผ ๊ทธ๋Œ€๋กœ ์“ฐ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ encoder ์ˆ˜์ค€์—์„œ๋Š” gap์ด ์ด๋ฏธ ์กด์žฌํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ด…๋‹ˆ๋‹ค. RAG๋‚˜ embedding ๊ธฐ๋ฐ˜ retrieval์„ ์“ฐ๋Š” ์‹œ์Šคํ…œ์ด๋ผ๋ฉด ๊ณ„์—ด ๋ฌด๊ด€ํ•˜๊ฒŒ ์ด post-processing์„ ๊ณ ๋ คํ•ด๋ณผ ๋งŒํ•ด๋ณด์ž…๋‹ˆ๋‹ค. ๋ณ„๊ฑด์ด์ง€๋งŒ ์นด๋‚˜๋‚˜ ์˜ด๋‹ˆ๋ชจ๋ธ ํ™œ์šฉํ•ด์„œ๋„ ๋น„์Šทํ•œ ๋ฌธ์ œ ํ˜„์ƒ์— ๋Œ€ํ•œ ํ™•์ธ์ •๋„๋ฅผ ์ง„ํ–‰ํ•ด๋ด„์ง ํ•˜๋‹ค๊ณ  ๋А๋‚๋‹ˆ๋‹ค.