1 minute read

Meta info.
  • Authors: Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, Tianlu Wang
  • Paper: https://arxiv.org/pdf/2501.18099
  • Affiliation: Meta AI
  • Published: January 30, 2025

TL; DR

์‚ฌ์ „์— ํ‰๊ฐ€ ๊ธฐ์ค€์„ ์ œ๊ณตํ•˜์ง€ ์•Š๊ณ , ์ž์ฒด์ ์œผ๋กœ ํ‰๊ฐ€ ๊ณ„ํš-์‹คํ–‰-ํŒ๋‹จ์„ ๋ถ„๋ฆฌํ•˜์—ฌ ์ˆ˜ํ–‰ํ•˜๋Š” Self-training loop์˜ thinking-llm-as-a-judge framework ์ œ์•ˆ, ์ ์€ ๋ฐ์ดํ„ฐ๋กœ๋„ SOTA ์„ฑ๋Šฅ๋‹ฌ์„ฑ

image.png

image.png

image.png

image.png

image.png

image.png

Background

LLM-as-a-Judge ์Šคํƒ€์ผ์˜ machine eval์ด human eval์„ ๋Œ€์ฒดํ•˜๋Š” ์—ฐ๊ตฌ ์„ฑํ–‰

  • ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” ์‚ฌ์ „ ์ •์˜๋œ ํ‰๊ฐ€ ๊ธฐ์ค€(criteria), ์ฐธ์กฐ ์ •๋‹ต(reference answers), ๊ฒ€์ฆ ์งˆ๋ฌธ(verification questions) ๋“ฑ ํ•„์š”

Problem States

  • ๋„๋ฉ”์ธ/๋ชฉ์  ๋ณ„ ์ง์ ‘ ํ‰๊ฐ€ ๊ด€๋ จ ์š”์†Œ๋ฅผ ์ง์ ‘ ์กฐ์ •ํ•ด์•ผํ•˜๊ณ , ์ผ๋ฐ˜ํ™” ๋ถˆ๊ฐ€๋Šฅ
  • ํ‰๊ฐ€์—์„œ Planning๊ณผ reasoning์ด ํ˜ผ์žฌ๋จ. (์ฒด๊ณ„์„ฑ ๋ถ€์กฑ)

Suggestion

EvalPlanner (Thinking-LLM-as-a-Judge)

  • ์ฃผ์š” ๊ฐœ๋… ๋ฐ ์ ˆ์ฐจ: evaluation plan (z) generation > plan execution (e) > final verdict (y) (์ˆœ์„œ์ƒ ์„ธ๋ฒˆ์งธ pic, method overview)
    • p(z x): ์ž…๋ ฅ x์— ๋Œ€ํ•œ ํ‰๊ฐ€ ๊ณ„ํš z ์ƒ์„ฑํ•˜๊ณ 
    • p(e z, x, a, b): ๊ณ„ํš z์— ๋”ฐ๋ผ ์‘๋‹ต a์™€ b ํ‰๊ฐ€
    • p(y e, z, x, a, b): ํ‰๊ฐ€ ์‹คํ–‰ํ•œ ๊ฒฐ๊ณผ ๋ฐ”ํƒ•์œผ๋กœ ์ตœ์ข… ํŒ๋‹จ y ์ƒ์„ฑ
  • DPO ํ•™์Šต ๊ณผ์ •ย Table 2
    • initial policy SFT: z, e, y ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฅด๋„๋ก, ์˜ฌ๋ฐ”๋ฅธ ํ‰๊ฐ€ ๊ณผ์ •์„ ๋”ฐ๋ฅด๋Š” ์‚ฌ๋ก€๋งŒ์œผ๋กœ ํ•™์Šต
    • 1st DPO: ๋งž๋Š” ํ‰๊ฐ€ / ๋‚˜์œ ํ‰๊ฐ€ ๋น„๊ต ํ•™์Šต
    • 2st DPO: 1st DPO๋œ ์ƒˆ๋กœ์šด z, e๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ์ตœ์ ํ™” Loop ์ˆ˜ํ–‰
  • ์ฃผ์š” ํŠน์ง•
    • planning๊ณผ execution ๋ถ„๋ฆฌ: ๋ชจ๋ธ์ด ํ‰๊ฐ€ ๊ณ„ํš ์ƒ์„ฑ ํ›„ ๊ณ„ํš ์‹คํ–‰ํ•˜์—ฌ ํ‰๊ฐ€ ์ˆ˜ํ–‰โ†’ ํ‰๊ฐ€์— ์‹ ๋ขฐ์„ฑ/ํˆฌ๋ช…์„ฑ/์ง๊ด€์„ฑ ์ฆ๊ฐ€
    • self-training loop ํ™œ์šฉ: ๋ชจ๋ธ ์Šค์Šค๋กœ ๋ฐ˜๋ณต์ ์œผ๋กœ/์ž์ฒด์ ์œผ๋กœ ์ƒ์„ฑํ•œ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์ ์ง„์  ์„ฑ๋Šฅ ํ–ฅ์ƒ (Self-Optimization)
    • ์‚ฌ์ „ ์ •์˜๋œ ๊ธฐ์ค€ ์—†์ด(๋…ผ๋ฌธ์—์„œ unconstrained๋กœ ํ‘œํ˜„) planning ๊ฐ€๋Šฅ: ๋‹ค์–‘ํ•œ Task/๋„๋ฉ”์ธ ์ ์šฉ ๊ฐ€๋Šฅ
    • ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ ์ฆ๊ฐ€: ๋” ์ ์€ (22k ๊ทœ๋ชจ ์ƒ์„ฑ ๋ฐ์ดํ„ฐ๋กœ preference pairs๋กœ DPO ํ•™์Šต) ๋†’์€ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ

Effects

  • task: RewardBench(LLM reward model ํ‰๊ฐ€), FollowBenchEval(๋‹จ๊ณ„๋ณ„ constraints ํ‰๊ฐ€), RM-Bench (๋ชจ๋ธ robustness ๊ฒ€์ฆ), JudgeBench (๋‹ค์–‘ํ•œ ๋ถ„์•ผ LLM-as-a-judge ๋Šฅ๋ ฅ ํ‰๊ฐ€)
  • results:
    • RewardBench :SOTA, ์‹ฌ์ง€์–ด ๋” ์ ์€ ๋ฐ์ดํ„ฐ
    • FollowBenchEval: ์ด์ „ SOTA ๋Œ€๋น„ 13% ์„ฑ๋Šฅ ํ–ฅ์ƒ
    • RM-Bench: ์ด์ „ SOTA ๋Œ€๋น„ 8% ํ–ฅ์ƒ
    • JudgeBench ์—ญ์‹œ reasoning์—์„œ ๊ฐ•์ 

Personal note. ๋‹จ์ˆœํ•œ ํ๋ฆ„์ธ๋ฐ ์ด๋ฒˆ ์—ฐ๊ตฌ์—์„œ ํŒ€์›์ด ๋งก์•„์ฃผ๊ณ  ์žˆ๋Š” ๋ถ€๋ถ„, Langchain ์“ธ ๋•Œ ๊ฒ€์ˆ˜ํ•˜๋Š” ๊ณผ์ •์— ๋Œ€ํ•œ ๋ ˆํผ๋Ÿฐ์Šค๋กœ ์ž‘์„ฑ ๊ฐ€๋Šฅํ•  ๊ฒƒ ๊ฐ™์•„์š”. ๋ฌผ๋ก  ์šฐ๋ฆฌ๋Š” DPO๊นŒ์ง„ ํ•˜์ง„ ์•Š์ง€๋งŒโ€ฆ ๋ชจ์ชผ๋ก ๊ฐ™์ด ํ™•์ธํ•ด๋ด…์‹œ๋‹ค