1 minute read

Meta info.

TL; DR

Divide-and-Conquer ์ „๋žต์— ๊ธฐ๋Šฅ์  ํ•ฉ์˜(functional consensus)๋ฅผ ์ ‘๋ชฉํ•œ CodeGen framework FUNCODER ์ œ์•ˆ

image.png

image.png

image.png

image.png

Background

LLM ๋“ฑ์žฅ์œผ๋กœ codegen ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๋ฐœ์ „๋œ ๊ฑด ๋งž์ง€๋งŒ, requirements๊ฐ€ ๋ณต์žกํ•œ programming์€ ์—ฌ์ „ํžˆ ๋„์ „ ๊ณผ์ œ

Problem States

codegen ๊ณผ์ •์— step์„ ๋‚˜๋ˆ„๊ฑฐ๋‚˜ pipeline ๋ฐฉ์‹์„ ์ ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ๋ณต์žก์„ฑ์ด ์ฆ๋Œ€๋˜๊ณ (์•ž๋‹จ๊ณ„ ์˜ค๋ฅ˜ ์‹คํŒจ์‹œ ๋’ค๋กœ ์ „ํŒŒ), LLM ์˜์กด์„ฑ์ด ๋†’๊ณ  (๋ณดํ†ต ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ์— ์˜์กด), LLM์ด ์ƒ์„ฑํ•œ Testcase๋ฅผ ํ™œ์šฉํ•œ self-testing์˜ ์‹ ๋ขฐ์„ฑ ์ €ํ•˜ ๋“ฑ ๋ฌธ์ œ๊ฐ€ ์‚ฐ์žฌ

  • research question: ๋ณต์žก๋„ ๋†’๊ณ  self-testing์˜ ์‹ ๋ขฐ๋„๋ฅผ ๋†’์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ codegen์„ (์ž‘์€ ๋ชจ๋ธ์—์„œ ๋”) ํšจ์œจ์ ์œผ๋กœ ํ•˜๋Š” ๋ฐฉ๋ฒ• ๊ณ ์•ˆ

Suggestions

FUNCODER

  • Divide-and-Conquer: ๋ณต์žก์„ฑ ๊ฐ์†Œ ํšจ๊ณผ ๊ธฐ๋Œ€
    • Divide: ๋ณต์žกํ•œ ๋ฌธ์ œ(froot)๋ฅผ ํ•˜์œ„ ํ•จ์ˆ˜(fi)๋กœ ์žฌ๊ท€์ ์œผ๋กœ ๋ถ„ํ•ด, ๊ณ„์ธต์  ์˜์กด์„ฑ ํŠธ๋ฆฌ ์ƒ์„ฑ(T)
    • Conquer: ๊ฐ ํ•จ์ˆ˜๋Š” ๋…๋ฆฝ์ ์œผ๋กœ ํ’€๊ณ  ๋‚˜์ค‘์— ์ตœ์ข… ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ ํ•ฉ์„ฑ. T์˜ leaf๋ถ€ํ„ฐ bottom up ๋ฐฉ์‹์œผ๋กœ ํƒ์ƒ‰ํ•˜์—ฌ ํ•ฉ์„ฑ์„ ๋ฐ˜๋ณตํ•˜์—ฌ f*cur ๊ตฌํ˜„
  • functional consensus: ์˜ค๋ฅ˜ ์ „ํŒŒ ๋ฐฉ์ง€ ๊ธฐ๋Œ€
    • ํ›„๋ณด ํ•จ์ˆ˜๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ ์ƒ˜ํ”Œ๋ง(nํšŒ ์ƒ์„ฑ) > ํ”„๋กœ๊ทธ๋žจ ๋™์ž‘ ์œ ์‚ฌ์„ฑ ์ธก์ •(๋‹ค์–‘ํ•œ ์ž…๋ ฅ์— ๋Œ€ํ•œ ์ถœ๋ ฅ ๊ฒฐ๊ณผ ๋น„๊ต) > ์œ ์‚ฌ์„ฑ ๊ธฐ๋ฐ˜ ํ•ฉ์˜ ํ•จ์ˆ˜(๊ฐ€์žฅ ์ผ๊ด€๋œ ๋™์ž‘์„ ๋ณด์ด๋Š” ํ›„๋ณด) ์„ ํƒ
    • ์ƒ์„ฑ ์ฝ”๋“œ ์‹ ๋ขฐ์„ฑ ํ–ฅ์ƒ ํšจ๊ณผ
      • self-testingํ•˜๋ฉด ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค ์ž์ฒด์˜ ์‹ ๋ขฐ์„ฑ ๋ณด์žฅ ์•ˆ ๋˜๋Š” ๋ฌธ์ œ์— ๋Œ€ํ•ด, ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋งŒ๋“ค์ง€ ์•Š๊ณ  ํ•จ์ˆ˜๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ ๋งŒ๋“ค๊ณ  ๊ฐ๊ฐ์˜ ์ถœ๋ ฅ์„ ์ง์ ‘ ๋น„๊ตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์šฐํšŒ
    • LLM ํฌ๊ธฐ์— ์ƒ๊ด€์—†์ด codegen ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒฐ๋ก 
    • ํ† ํฐ ์‚ฌ์šฉ๋Ÿ‰๋„ ๊ธฐ์กด ๋ฐฉ์‹์ด๋ž‘ ๋น„์Šทํ•˜๋‹ค๊ณ .

Effects

FUNCODER๊ฐ€ LLM ๊ธฐ๋ฐ˜ ์ฝ”๋“œ ์ƒ์„ฑ์˜ ํšจ์œจ์„ฑ๊ณผ ํ’ˆ์งˆ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ํฌ๊ฒŒ ๊ธฐ์—ฌํ•œ๋‹ค.

  • backbone: GPT-3.5-turbo, GPT-4 ๋“ฑ
  • ์ฝ”๋“œ ์ƒ์„ฑ (HumanEval, MBPP, xCodeEval) Result:
    • ๊ธฐ์กด ๋Œ€๋น„ ํ‰๊ท  HumanEval +9.8%, MBPP +3.3%, xCodeEval +10.4% ํ–ฅ์ƒ
    • funcoder-StableCode3b:ย HumanEval ๊ธฐ์ค€ GPT-4 ์„ฑ๋Šฅ์˜ ์•ฝ 97.7%์— ๊ทผ์ ‘ (vanilla GPT-3.5์˜ ์•ฝ 118.6%์˜ ์ƒ๋Œ€์  ์„ฑ๋Šฅ)
    • ์ •ํ™•์„ฑ ํ‰๊ฐ€์—์„œ functional consensus๊ฐ€ self-testing๋ณด๋‹ค ์šฐ์ˆ˜ (์‹ ๋ขฐ์„ฑ ํ–ฅ์ƒ)
  • ์ˆ˜ํ•™ ์ถ”๋ก  (MATH) Result:
    • funcoder-GPT-4: SOTA Cumulative Reasoning ๋Œ€๋น„ +(6.0, 8.3%), PoT(vanilla program-aided baseline) ๋Œ€๋น„ +(10.0, 14.7%)
    • funcoder-GPT-3.5-turbo: SOTA Cumulative Reasoning ๋Œ€๋น„ +(6.2, 11.1%), PoT ๋Œ€๋น„ +(13.0, 31.7%)
    • Divide-and-Conquer๊ฐ€ ๋„๋ฉ”์ธ๋ณ„๋กœ ํŠน์ •๋ผ์„œ ๊ฐ MATH ์ฃผ์ œ์— ํ•„์š”ํ•œ ํŠน์ • ์ง€์‹์„ ๋ฐ˜์˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ ๋†’์ด๋Š”๋“ฏ

Personal note. ํšจ๊ณผ๊ฐ€ ์ข‹์€๊ฑด ๋ฌผ๋ก ์ด๊ณ  ์ž‘์€ ๋ชจ๋ธ์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ๋„ ์ธ์ƒ์ ์ด๊ณ  ๊ทธ๋ฆผ๋“ค์ด ๋ถ„๋ช…ํ•ด์„œ ๊ทธ๋Ÿฐ์ง€ ๋…ผ๋ฌธ๋„ ๋˜๊ฒŒ ๊น”๋”ํ•˜๊ฒŒ ์“ฐ์—ฌ์žˆ๋„ค์š”.