less than 1 minute read

Meta info.

TL; DR

์ž๊ธฐ ๋ฐ˜์„ฑ์ (?) ๊ทผ๊ฑฐ์™€ ๋‹ค์ค‘ ์ถ”๋ก  chain์œผ๋กœ LLM์—์„œ ์‹ ๋ขฐ๋„ ๋ณด์ • ์˜ค๋ฅ˜๋ฅผ 30% ์ค„์ธ๋‹ค

Untitled

Untitled

Untitled

Problem States

LLM hallucination ๊ด€๋ จ, ์ƒ์„ฑ๊ฒฐ๊ณผ์˜ ์‹ ๋ขฐ๋„ ์ถ”์ •์น˜๊ฐ€ ์—†๋Š”๋ฐ, ๊ธฐ์กด์˜ prompt๋ฅผ ๊ฑด๋“ค๊ฑฐ๋‚˜ trainingํ•˜๋Š” ๋ฐฉ์‹์˜ ์‹ ๋ขฐ๋„ ์ถ”์ •์€ ๊ฐ„์ ‘์  ํ˜น์€ ์ฐจ์„ ์ฑ…์œผ๋กœ ๋ณด์—ฌ์ง. (pic1ย )

Suggestions

SaySelf - LLM ์‹ ๋ขฐ๋„ ์ถ”์ •์น˜ ์ œ๊ณต ๋ฐ ์ž๊ธฐ๋ฐ˜์„ฑ์  ์ถ”๋ก  (๊ทผ๊ฑฐ) ์ง์ ‘ ์ƒ์„ฑ(pic2)

  • (stage 1) finetuning: construct supervised dataset (๊ตฌ์„ฑ: q, ์ถ”๋ก  chain ํฌํ•จ a, ์ž๊ธฐ ๋ฐ˜์„ฑ์ (?) ๊ทผ๊ฑฐ, 10์  ์ฒ™๋„์˜ ์‹ ๋ขฐ๋„ ์ถ”์ •์น˜)
    • HotpotQA์˜ 90K ์งˆ๋ฌธ์œผ๋กœ ๋‹คํšŒ ํ”„๋กฌํ”„ํŒ… โ†’ ์ƒ์„ฑ๋ฌผ clustering (์‚ฌ์ „ ์ •์˜๋œ ํฌ๊ธฐ s๋งŒํผ)โ†’ cluster์—์„œ 1๊ฐœ ์‘๋‹ต ์„ ํƒ
    • ์‹ ๋ขฐ๋„ ์ถ”์ •์น˜: ์„ ํƒ๋œ ์‘๋‹ต๊ณผ gold answer๋ž‘ ๋น„๊ตํ•ด์„œ c ๊ฒฐ์ • (c=round(s/N*10) ํœด๋ฆฌ์Šคํ‹ฑํ•˜๊ฒŒ ์ •์˜๋จ)
    • ์ž๊ธฐ๋ฐ˜์„ฑ์  ๊ทผ๊ฑฐ: GPT4์—๊ฒŒ ์‘๋‹ต์˜ ๋ถˆ์ผ์น˜์„ฑ์„ ๋ถ„์„ ๋ฐ ์š”์•ฝํ•˜๊ฒŒ ํ•ด์„œ 1์ธ์นญ ์‹œ์ ์—์„œ ๋‚ด์šฉ ์ •๋ฆฌ์‹œํ‚ด
  • (stage 2) reinforcement learning (from stage 1 task supervision): finetuning๋งŒ์œผ๋กœ๋Š” ์ •๋‹ต์—๋Š” ๋‚ฎ์€, ์˜ค๋‹ต์—๋Š” ๋†’์€ ์‹ ๋ขฐ๋„๋ฅผ ๋ณด์ด๋Š” ๊ฒฝ์šฐ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‹ ๋ขฐ๋„ ์ถ”์ •์น˜ ๋ณด์ • ์‹œ๋„
    • ๋ชจ๋ธprompt ์š”์ฒญ์‚ฌํ•ญ: a, ์ž๊ธฐ ์„ฑ์ฐฐ์  ๊ทผ๊ฑฐ, ์‹ ๋ขฐ ์ˆ˜์ค€ c ์ƒ์„ฑ
    • PPOํ™œ์šฉ, reward function ์ •์˜ (pic3)

Effects

์‹ ๋ขฐ๋„ ๋ณด์ • ์˜ค๋ฅ˜ 30% ๊ฐ์†Œ, distribution ๋‚ด์™ธ ๋ฐ์ดํ„ฐ ๋ชจ๋‘์—์„œ ์„ฑ๋Šฅ ์œ ์ง€. ์ƒ์„ฑ๋œ ๊ทผ๊ฑฐ๋Š” calibration ์ •ํ™•๋„ ํ–ฅ์ƒ