2 minute read

Meta info.
  • Authors: Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R. Fung, Weizhu Chen, Minhao Cheng, Furu Wei
  • Paper: https://arxiv.org/pdf/2503.19551
  • Affiliation: HKUST, Microsoft, Peking Univ, Pennsylvania State Univ.
  • Published: March 26, 2025

TL; DR

SYNTHLLM ๋ฐฉ์‹์œผ๋กœ ์ƒ์„ฑํ•œ ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ๋Š” LLM finetuning์— ๋Œ€ํ•ด ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๊ณ  ํšจ๊ณผ์ ์œผ๋กœ scale ๋˜๊ณ , ์ˆ˜์ •ํ•œ scaling law์— ๋”ฐ๋ผ natural data ๋ถ€์กฑ์— ๋Œ€ํ•œ ํ™•์žฅ๊ฐ€๋Šฅํ•œ ์†”๋ฃจ์…˜์ด ๋œ๋‹ค๊ณ  ์ฃผ์žฅ

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Background

LLM ์‹œ๋Œ€ ์ดํ›„๋กœ ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ ์ƒ์„ฑํ•ด์„œ ์“ฐ๋Š” ํ๋ฆ„์ด ์ผ๋ฐ˜์ ์ธ ์ถ”์„ธ๊ฐ€ ๋˜์—ˆ์œผ๋‚˜, natural data๋งŒํผ ๋งŽ์„์ˆ˜๋ก ์ข‹์€์ง€๋Š” ๋ฐํ˜€์ง„ ๋ฐ”๊ฐ€ ์—†์Œ

  • ์ตœ๊ทผ ์—ฐ๊ตฌ(Lin et al., 2024)์—์„œ scaling law๋ฅผ fine-tuning์—๋„ ์ ์šฉํ•œ ์—ฐ๊ตฌ๋ฅผ ๋ฐ”ํƒ•์— ๋‘ 
  • ๊ธฐ์กด scaling law์˜ ์—ฐ๊ตฌ๋Š” ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ์˜ ์˜ˆ์ธก ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์•ˆํ–ˆ์ง€๋งŒ ์ด๋Š” natural data = organic data์— ๋Œ€ํ•œ ์ ์šฉ
  • synthetic data๋Š” ์‹œ๋“œ๋กœ human-annotated data๋ฅผ ์ฃผ๊ณ  ์ƒ์„ฑํ•ด์„œ ๊ทœ๋ชจ๋‚˜ ๋‹ค์–‘์„ฑ์ด ์ œํ•œ์ ์ด๋ผ๊ณ  ๋ณด๋Š” ์‹œ๊ฐ์ด ์ผ๋ฐ˜์ 
  • ์ตœ๊ทผ ์–‘์งˆ์˜ pretrainig data ๊ณ ๊ฐˆ ๋ฌธ์ œ๋„ ์ฃผ์š” ๊ด€์‹ฌ <- synthetic ๋ฐ์ดํ„ฐ๋กœ ํ•ด๊ฒฐ ๊ฐ€๋Šฅ์„ฑ ๊ฒ€ํ† 

Problem States

  • RQ1ย synthetic data๋„ ๊ฐ™์€ scaling law๋ฅผ ๋”ฐ๋ฅผ ์ˆ˜ ์žˆ๋‚˜?
  • RQ2ย synthetic data ์ƒ์„ฑ์‹œ ํ•„์š”ํ•œ seed๊นŒ์ง€๋„ human ๋ฐฐ์ œํ•˜๊ณ  web-scale๋กœ ํ•  ์ˆ˜ ์žˆ์„๊นŒ?

Suggestions

  • RQ1ย Scaling Law of Synthetic Data: SFT setup์—์„œ synthetic data์˜ ์–‘๊ณผ ๋ชจ๋ธ ์„ฑ๋Šฅ๊ฐ„ ์˜ˆ์ธก๊ฐ€๋Šฅํ•œ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ๋Š”๊ฐ€? (ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๊ฐ€ ๋Š˜์–ด๋‚˜๋ฉด ๋ชจ๋ธ ์„ฑ๋Šฅ๋„ ์ข‹์•„์ง€๋Š”๊ฐ€?) >ย YES
    • synthetic data๋Š”ย Lin et al., 2024์˜ Rectified Scaling Law๋ฅผ ๋”ฐ๋ฆ„ย pic2
      • ์ฆ‰ ๋ฐ์ดํ„ฐ๊ฐ€ ๋Š˜์ˆ˜๋ก ๋ชจ๋ธ ์„ฑ๋Šฅ๋„ ์˜ˆ์ธก๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ฅ์ƒ๋จย Fig 1,2
    • ๋‹ค๋งŒ 300B ํ† ํฐ์ด ๋„˜์–ด์„œ๋ฉด ์ทจํ•ด์ง€๋Š” ์ด๋“์ด ์ค„๊ธฐ ์‹œ์ž‘ (์„ฑ๋Šฅ ํ–ฅ์ƒ ํญ์ด ์ค„์–ด๋“ฆ)
    • ๋” ํฐ๋ชจ๋ธ์ด ์ตœ๊ณ ์„ฑ๋Šฅ ๋‹ฟ๋Š”๋ฐ๊นŒ์ง€ ํ•„์š”ํ•œ ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ ์–‘์ด ๋” ์ ์—ˆ์Œ (๋” ๋นจ๋ฆฌ ๋„๋‹ฌํ•œ๋‹ค๋Š” ์˜๋ฏธ๋กœ ์ดํ•ด): 8B๋Š” 1T์—์„œ ์ƒํ•œ์ด์—ˆ๋Š”๋ฐ 3B๋Š” ๋น„์Šทํ•œ ์„ฑ๋Šฅ์— ๋„๋‹ฌํ•˜๋ ค๋ฉด 4T๊นŒ์ง€ ํ•„์š”ํ–ˆ๋‹ค๊ณ ย Tab 1
    • ๊ฝค ์ •ํ™•ํ•˜๊ฒŒ ๊ทœ๋ชจ ๋Œ€๋น„ ์„ฑ๋Šฅ ํ–ฅ์ƒ ์˜ˆ์ธก ๊ฐ€๋Šฅย Tab 1ย Fig 2
    • ์‹คํ—˜์—์„œ ์‚ฌ์šฉํ•œ ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ๋Š”ย SYNTHLLM์œผ๋กœ ๊ตฌ์ถ•๋จ
  • RQ2ย SYNTHLLMย (web-scale synthetic data generation framework): organic data๋ฅผ synthetic data๋กœ ๋Œ€์ฒดํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•˜๊ณ  ํ™•์žฅ๊ฐ€๋Šฅํ•œ ๋Œ€๊ทœ๋ชจ ์ƒ์„ฑํ•˜๊ธฐ
    1. Reference Document Filtering: ๋ณ„๋„ classifier ํ•™์Šตํ•ด์„œ Fineweb-Edu ๊ฐ™์€ ์›น ๋ฐ์ดํ„ฐ ์ €์žฅ์†Œ์—์„œ ์ˆ˜ํ•™ ๊ฐ™์€ ํŠน์ • ๋„๋ฉ”์ธ์˜ ๊ณ ํ’ˆ์งˆ ๋ฌธ์„œ ํ•„ํ„ฐ๋ง
    2. Document-Grounded Question Generation: ๋ฌธ์„œ์—์„œ ์ค‘์š”ํ•œ ๊ฐœ๋…์œผ๋กœ question ๋งŒ๋“ค๊ธฐ
      1. Lv.1: ๋ฌธ์„œ์—์„œ ์ง์ ‘ ์งˆ๋ฌธ ์ถ”์ถœ (๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ์‹)
      2. Lv.2: ๋ฌธ์„œ์• ์„œ ํ•ต์‹ฌ ๊ฐœ๋…๋งŒ ์ถ”์ถœ > ๋ฌด์ž‘์œ„ ์กฐํ•ฉ > ์งˆ๋ฌธ์ƒ์„ฑ
      3. Lv.3: ๋ฌธ์„œ์—์„œ ๊ฐœ๋…๋“ค์„ ์—ฐ๊ฒฐํ•œ ๊ทธ๋ž˜ํ”„ ๊ตฌ์ถ•(Global Concept Graph Construction) > random walk์œผ๋กœ ๋‹ค์–‘ํ•œ ๊ฐœ๋… ์กฐํ•ฉ ์ƒ์„ฑ(Concept Combination Sampling) > ์งˆ๋ฌธ์ƒ์„ฑ (๊ฐ€์žฅ ๋‹ค์–‘ํ•˜๊ณ  ์Šค์ผ€์ผ ํ™•์žฅ์„ฑ ๋ณด์žฅ๋œ๋‹ค๊ณ  ์ฃผ์žฅ)
    3. Answer Generation: open-sourced LLM์œผ๋กœ ๋‹ต๋ณ€ ์ƒ์„ฑ

Effects

  • ๊ธฐ์กด ์ฆ๊ฐ• ๋ฐฉ๋ฒ•๋ณด๋‹ค SYNTHLLM์ด (์งˆ๋ฌธ) ๋‹ค์–‘์„ฑFig 5๊ณผ ํ™•์žฅ์„ฑFig 6์—์„œ ์šฐ์œ„
  • ์ œ์•ˆ ๋ฐฉ์‹์œผ๋กœ ๊ตฌ์ถ•ํ•œ ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ๋Œ€์ฒด๋กœ๋Š” ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•˜๊ฑฐ๋‚˜ ๋น„์Šทํ–ˆ๊ณ , ๊ทœ๋ชจ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๊ฒฝ์šฐ ์‹ค์ œ ์„ฑ๋Šฅ๋„ ์ข‹์•˜์Œ์„ ์ฆ๋ช…ย Tab 2,3
    • backbone: Llama-3.2-1B, Llama-3.2-3B, Llama-3.1-8B
    • target task: Mathematical Reasoning
    • baseline datasets: OpenMathInstruct-2, MAmmoTH2, NaturalReasoning, JiuZhang 3.0, NuminaMath ๋“ฑ

Personal note. mathematical reasoning์— ํ•œ์ •๋œ ์‹คํ—˜์ธ๊ฑด ๋‹ค์†Œ ์ œํ•œ์ ์œผ๋กœ ๋ณด์—ฌ์งˆ ์—ฌ์ง€๋Š” ์žˆ์ง€๋งŒ, ๋‹ค๋ฅธ QA task์—์„œ๋„ ๋น„์Šทํ•œ ์ถ”์„ธ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋ฉ๋‹ˆ๋‹ค.