2 minute read

Meta info.
  • Authors: Chaojun Xiao, Zhengyan Zhang, Chenyang Song, Dazhi Jiang, Feng Yao, Xu Han, Xiaozhi Wang, Shuo Wang, Yufei Huang, Guanyu Lin, Yingfa Chen, Weilin Zhao, Yuge Tu, Zexuan Zhong, Ao Zhang, Chenglei Si, Khai Hao Moo, Chenyang Zhao, Huimin Chen, Yankai Lin, Zhiyuan Liu, Jingbo Shang, Maosong Sun
  • Paper: https://arxiv.org/pdf/2409.02877
  • Affiliation: CMU, ModelBest Inc., NUS, Prinston Univ., Renmin Univ., Stanford Univ., Tsinghua Univ., UCLA, University of California San Diego
  • Published: September 4, 2024

TL; DR

LLM์„ ์ธ๊ฐ„์˜ ๋‡Œ์™€ ๊ฐ™์ด ๊ธฐ๋Šฅ์  ๋ชจ๋“ˆ๋กœ ์ ‘๊ทผํ•˜์ž๋Š” ๊ด€์  ์ œ์•ˆ (brick ๋‹จ์œ„๋กœ ๋ถ„ํ•ด)๊ณผ ๊ฒฝํ—˜์  ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ณด๊ณ 

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Configurable Foundation Models: Building LLMs from a Modular Perspective

  • TL; DR: LLM์„ ์ธ๊ฐ„์˜ ๋‡Œ์™€ ๊ฐ™์ด ๊ธฐ๋Šฅ์  ๋ชจ๋“ˆ๋กœ ์ ‘๊ทผํ•˜์ž๋Š” ๊ด€์  ์ œ์•ˆ (brick ๋‹จ์œ„๋กœ ๋ถ„ํ•ด)๊ณผ ๊ฒฝํ—˜์  ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ณด๊ณ 

Problem States

LLM์ด ๋„ˆ๋ฌด ์ปค์ง์— ๋”ฐ๋ผ cost ๋ฌธ์ œ๊ฐ€ ์ปค์ง. ์ƒ๋Œ€์ ์œผ๋กœ adaptation๋„ ์–ด๋ ต๊ณ  ํˆฌ๋ช…์„ฑ๋„ ๋–จ์–ด์ง. โ†’ LLM์„ ์ธ๊ฐ„์˜ ๋‡Œ์ฒ˜๋Ÿผ ๋ถ„๋ฆฌํ•  ์ˆ˜๋Š” ์—†์„๊นŒ?

Suggestion

Brick ๋‹จ์œ„๋กœ ๋ชจ๋ธ์„ ๋ถ„ํ•ดํ•˜์—ฌ ์ ‘๊ทผํ•˜๋Š” ๊ด€์  ์†Œ๊ฐœ

  • Emergent Bricks: pretraining ๊ณผ์ •์—์„œ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ƒ์„ฑ.
    • Activation Sparsity: ๋งˆ์น˜ ์ธ๊ฐ„๋„ ์–ด๋–ค ๊ธฐ๋Šฅ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๋‡Œ์˜ ํŠน์ • ๋ถ€๋ถ„๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, LLM๋„ ์ž…๋ ฅ์— ๋”ฐ๋ผ ํŠน์ • ๋‰ด๋Ÿฐ๋งŒ ํ™œ์„ฑํ™”ํ•จ.ย ์‹คํ—˜1
    • Function Localization: LLM ๋‰ด๋Ÿฐ๋“ค์€ ํŠน์ • ๊ธฐ๋Šฅ ์ˆ˜ํ–‰์— ํŠนํ™”๋จ. (๋ฒˆ์—ญ, ์ฝ”๋”ฉ, ๊ฐ์ •๋ถ„์„,,)ย ์‹คํ—˜2
    • Human-Defined Emergent Bricks์„ ํ•  ์ˆ˜๋„ ์žˆ๋Š”๋ฐ, ๊ฐ€๋ น Transformer์˜ Multi-head att. layer๋‚˜ FFN ์˜ ๊ฐ head๋‚˜ ๋‰ด๋Ÿฐ ๋“ฑ์„ ํ•˜๋‚˜์˜ brick์œผ๋กœ ๊ฐ„์ฃผ
    • Self-Organized Emergent Bricks์€ ํ•™์Šตํ•˜๋ฉด์„œ activation pattern๊ธฐ๋ฐ˜์œผ๋กœ ์ž๋ฐœ์ ์œผ๋กœ clustering๋˜๋Š” ๊ฒฝ์šฐ. cluster๋ณ„๋กœ ํŠน์ • ๊ธฐ๋Šฅ ์ˆ˜ํ–‰ย ์‹คํ—˜3
  • Customized Bricks: ์ง€์‹์„ ๋ชจ๋ธ์— ์ฃผ์ž…ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌํ›„ Training์—์„œ ์˜๋„์ ์œผ๋กœ ์„ค๊ณ„ํ•˜๊ฑฐ๋‚˜ ์ถ”๊ฐ€ํ•˜๋Š” brick
    • Task Bricks: Adapter, Prompt, Prefix Tuning, LoRA, โ€ฆ
    • Knowledge Bricks: KG, external context?
    • Modality Bricks: multi-modality ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ณ„๋„ ์„ค๊ณ„๋œ ๋ชจ๋“ˆ
  • Brick ํ™œ์šฉ ๋ฐฉ์•ˆ
    • retrieval & routing: ์ฃผ์–ด์ง„ ์ž…๋ ฅ์— ์ ํ•ฉํ•œ brick ์„ ํƒ
    • Combination: brick๊ฐ„ ๊ฒฐํ•ฉ ์‹œ๋„ (e.g. ๋™์ผ ๊ตฌ์กฐ brick๊ฐ„ ์„ ํ˜•๊ฒฐํ•ฉ, Stitching ๋“ฑ ์–ด๋–ค ์ˆœ์„œ๋ฅผ ๊ฐ–๊ณ  ๋ณต์žกํ•œ Reasoning ์ˆ˜ํ–‰ (Heuristicํ•  ์ˆ˜๋„, ๋ณ„๋„ planner model์„ ์‚ฌ์šฉํ• ์ˆ˜๋„)
    • Updating: ๊ธฐ์กด ์ •์˜ํ–ˆ๋˜ brick์ค‘ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๋ถ€๋ถ„๋งŒ ์—…๋ฐ์ดํŠธํ•˜๊ฑฐ๋‚˜, ์ƒˆ๋กœ์šด brick์„ ๋ณ„๋„๋กœ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜
    • Growing: ํ™•์žฅ ๊ฐœ๋…์œผ๋กœ pretraining๋ถ€ํ„ฐ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ ํ‚ค์šฐ๊ณ  ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ์‚ฌํ›„ ํ•™์Šต์œผ๋กœ ์ ‘๊ทผ
  • Granularity:ย ๋‰ด๋Ÿฐ๋‹จ์œ„, clusterํ™” ๋œ ๋‰ด๋Ÿฐ๋“ค, layer, ๋ชจ๋ธ ์ „์ฒด ๋“ฑ ๋‹จ์œ„ ์ •์˜ํ•˜๊ธฐ ๋‚˜๋ฆ„

Effects

  • Experimental setup: ์ œ์•ˆํ•˜๋Š” ๊ด€์ ์— ๋Œ€ํ•œ ์‹คํ—˜, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3์œผ๋กœ ๊ฒฝํ—˜์  ํ™•์ธ
  • Result:
    • ์‹คํ—˜1ย : ๋‚ฎ์€ activation ๊ฐ’์„ ๊ฐ€์ง„ ๋‰ด๋Ÿฐ๋“ค์„ masking โ†’ ๋ชจ๋ธ ์„ฑ๋Šฅ ๋ณ€ํ™” ์ธก์ •ย Figure 6, 7
      • ์‹ค์ œ ๋†’์€ Activation Sparsity๋ฅผ ๋ณด์ž„. ์ฆ‰, ์ž…๋ ฅ์— ๋Œ€ํ•ด ํ™œ์„ฑํ™”๋˜๋Š” ๋‰ด๋Ÿฐ ์ˆ˜๋Š” ์ ๋‹ค.
      • ๋‚ฎ์€ activation ๊ฐ’์„ ๊ฐ€์ง„ ๋‰ด๋Ÿฐ๋“ค์„ maskingํ•ด๋„ ๋ชจ๋ธ ์„ฑ๋Šฅ์— ์˜ํ–ฅ ๋ณ„๋กœ ์—†๋‹ค.
    • ์‹คํ—˜2ย : Infinity-Instruct ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ LLM ๋‰ด๋Ÿฐ activation๊ณผ downstream task ๊ธฐ๋Šฅ ์ƒ๊ด€์„ฑ ๋ถ„์„ย Figure 8
      • ํŠน์ • ๊ธฐ๋Šฅ์— Functionality Score๊ฐ€ ๋†’์€ ๋‰ด๋Ÿฐ์ด ๋”ฐ๋กœ ์žˆ๋‹ค.
      • layer ๋‹จ์œ„๋กœ ๋ณด๋ฉด, ํ•˜์œ„ layer๊ฐ€ ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ๋Šฅ์—์„œ activate๋˜๊ณ , ์ƒ์œ„ layer์ผ์ˆ˜๋ก ํŠน์ • ๊ธฐ๋Šฅ์— activate๋œ๋‹ค.
    • ์‹คํ—˜3ย : ์•ž์„  ์‹คํ—˜์—์„œ ๊ธฐ๋Šฅ๋ณ„๋กœ score๊ฐ€ ๋†’์€ ๋‰ด๋Ÿฐ 50๊ฐœ์”ฉ ๋ชจ์•„์„œ ๊ทธ๋ฃนํ™” โ†’ ๋‰ด๋Ÿฐ ๊ทธ๋ฃน๊ฐ„ ์œ ์‚ฌ๋„ ์ธก์ •
      • ์„œ๋กœ ๋‹ค๋ฅธ ๊ธฐ๋Šฅ์— ๋œ ๊ทธ๋ฃน์€ ์œ ์‚ฌ๋„ ๋‚ฎ๋‹ค.
      • (์–ด์ฉŒ๋ฉด LLM์€ ๋‰ด๋Ÿฐ๋ณ„๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ๋Šฅ์ด ๋ช…ํ™•ํžˆ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ๋Š” ๊ฒƒ์„ ์ฃผ์žฅ)
      • Figure 9ย perturbation study์—์„œ, ํŠน์ • ๊ธฐ๋Šฅ์— ๋Œ€ํ•œ ๋‰ด๋Ÿฐ ์ œ๊ฑฐ์‹œ ๋ชจ๋ธ PPL ๋ณ€ํ™”
        • ๋Œ€๊ฐ์„ : ํŠน์ • ๊ธฐ๋Šฅ์— ๋Œ€ํ•œ ๋‰ด๋Ÿฐ์„ ์ œ๊ฑฐํ•œ ํ›„ ํ•ด๋‹น ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•  ๋•Œ์˜ ์„ฑ๋Šฅ ํฌ๊ฒŒ ๋–จ์–ด์ง„๋‹ค.

Personal note. ์ผ๋ถ€๋Š” ์ด๋ฏธ ์ง„ํ–‰ํ˜•์ธ ๋ถ€๋ถ„๋„ ์žˆ๊ณ , ๋‹ค๋ฅธ ์ผ๋ถ€๋Š” ์•„์ง์€ ๋œฌ๊ตฌ๋ฆ„๊ฐ™์€ ์†Œ๋ฆฌ์ผ ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ๊ฑฐ์‹œ์ ์ธ ๊ด€์ ์„ ์ œ์•ˆํ•˜๋Š” ์ฃผ์žฅ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋Šฅ์„ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„๋А๋ƒ๋„, ๊ทธ๋ฃนํ™”ํ•œ ๊ฒƒ๋„, ๊ทธ์— ๋Œ€ํ•œ ์œ ์‚ฌ๋„๋ฅผ ํ™•์ธํ•œ ๊ฒƒ๋„ ๋ชจ๋‘ ๊ฒฝํ—˜์ ์ด์ง€๋งŒ, ๋งž๊ณ  ํ‹€๋ฆฌ๋ƒ๋ฅผ ๋– ๋‚˜์„œ ์–ผ๋งˆ๋‚˜ ๋„“์€ ์‹œ์•ผ๋กœ ํŠธ๋ Œ๋“œ๋ฅผ ๋ณด๋ ค๊ณ  ํ•˜๋Š”์ง€๋„ ์ค‘์š”ํ•œ ๊ฒƒ ๊ฐ™์•„์„œ ๊ณต์œ ๋“œ๋ ค๋ด…๋‹ˆ๋‹ค. ๋ถ„๋Ÿ‰์ƒ ์ •๋ฆฌํ•˜์ง€ ์•Š์•˜์ง€๋งŒ ๋…ผ๋ฌธ ๋’ทํŽธ์— ์ œ์‹œํ•˜๋Š” Discussion ๋“ฑ๋„ ๋‚˜๋ฆ„ ํฅ๋ฏธ๋กœ์šด ๋ถ€๋ถ„๋“ค์„ ๊ผฌ์ง‘์–ด์„œ ํ–ฅํ›„ ์ƒˆ๋กœ์šด ์—ฐ๊ตฌ ์ œ์•ˆํ•  ๋•Œ ๊ด€์‹ฌ๊ฐ€์ ธ๋ด„์ง ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.