4 minute read

Meta info.
  • Authors: Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, Xiaoxing Ma
  • Paper: https://arxiv.org/abs/2510.15444
  • Affiliation: Nanjing University, ETH Zurich
  • Conference: NeurIPS 2025
  • Published: October 17, 2025

TL; DR

Sampling-based test-time scaling์˜ ์ฒซ ์ด๋ก ์  ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜๊ณ , SC์™€ PPL์˜ ํ•œ๊ณ„๋ฅผ ๋ถ„์„ํ•˜์—ฌ ๋‘ ๋ฐฉ๋ฒ•์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•œ RPC๋ฅผ ์ œ์•ˆ

Slide 4 Figure 1: RPC overview Figure 2: Accuracy vs. sampling budget Figure 3: Reliability diagrams Figure 4: Code generation results Figure 5: RPC method details

Background

  • Test-time scaling: inference ์‹œ ์ถ”๊ฐ€ ๊ณ„์‚ฐ์„ ํˆฌ์ž…ํ•ด reasoning ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ์ ‘๊ทผ
    • sampling-based: N๊ฐœ์˜ reasoning path๋ฅผ ์ƒ์„ฑํ•˜๊ณ , confidence ์ถ”์ •์„ ํ†ตํ•ด Best-of-N ๋ฐฉ์‹์œผ๋กœ ์ตœ์  ๋‹ต ์„ ํƒ
    • LLM sampling์ด ํ™•๋ฅ ์ ์ด๋ผ ๊ฐ™์€ ์ž…๋ ฅ์— ๋งค๋ฒˆ ๋‹ค๋ฅธ ์ถœ๋ ฅ์ด ๋‚˜์˜ด โ†’ confidence estimation์ด ํ•ต์‹ฌ
  • ๊ธฐ์กด confidence estimation์˜ ๋‘ ๊ฐ€์ง€ ๋Œ€ํ‘œ ๋ฐฉ๋ฒ•
    • Self-Consistency (SC): n๊ฐœ์˜ path๋ฅผ ์ƒ˜ํ”Œ๋งํ•ด majority vote๋กœ confidence ์ถ”์ •
      • log probability ๋ถˆํ•„์š” โ†’ open/closed source ๋ชจ๋‘ ์ ์šฉ ๊ฐ€๋Šฅ
      • answer ๋‹จ์œ„๋กœ equivalent path๋ฅผ aggregation โ†’ model error ๋‚ฎ์Œ
      • ๋‹จ์ : error๊ฐ€ ์ƒ˜ํ”Œ ์ˆ˜์— ๋ฐ˜๋น„๋ก€ํ•ด์„œ๋งŒ ์ค„์–ด๋“ฆ (linear convergence) โ†’ ์ƒ˜ํ”Œ์ด ์ ์œผ๋ฉด ๋А๋ฆผ
    • Perplexity (PPL): LLM ๋‚ด๋ถ€ token probability๋ฅผ ๊ทธ๋Œ€๋กœ confidence๋กœ ์‚ฌ์šฉ
      • log probability ํ•„์š” โ†’ open-source only
      • ๊ฐ path๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ํ‰๊ฐ€ โ†’ model error ๋†’์Œ
      • error ์ˆ˜๋ ด์€ exponential๋กœ ๋น ๋ฅด์ง€๋งŒ, probability๊ฐ€ 0์— ๊ฐ€๊นŒ์šด ์–ด๋ ค์šด ๋ฌธ์ œ์—์„œ linear๋กœ degrade
  • ๋‘ ๋ฐฉ๋ฒ• ๋ชจ๋‘ ์‹ค์ œ๋กœ ์ž˜ ๋™์ž‘ํ•˜์ง€๋งŒ, ์™œ ์ž˜ ๋˜๋Š”์ง€, ์–ด๋–ป๊ฒŒ ๊ฐœ์„ ํ•ด์•ผ ํ•˜๋Š”์ง€์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์ด ์—†์—ˆ์Œ

Problem States

  • SC์˜ ๋ฌธ์ œ: error๊ฐ€ ์ƒ˜ํ”Œ ์ˆ˜์— ์„ ํ˜•์œผ๋กœ๋งŒ ์ค„์–ด๋“ฆ โ†’ ์ƒ˜ํ”Œ budget์ด ์ œํ•œ๋œ ์ƒํ™ฉ์—์„œ ์ถฉ๋ถ„ํ•œ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ ์–ด๋ ค์›€
  • PPL์˜ ๋ฌธ์ œ
    • ๊ฐ™์€ ๋‹ต์„ ๋‹ค๋ฅธ ํ‘œํ˜„์œผ๋กœ ์“ด ๋‘ path๊ฐ€ ๋‹ค๋ฅธ score๋ฅผ ๋ฐ›์Œ โ†’ aggregation ๋ถ€์žฌ๋กœ model error ํผ
    • probability๊ฐ€ ์ž‘์€ ์–ด๋ ค์šด ๋ฌธ์ œ์ผ์ˆ˜๋ก exponential ์ˆ˜๋ ด ์ด์ ์ด ์‚ฌ๋ผ์ง€๋Š” degeneration issue ์กด์žฌ
  • Research Question: PPL์˜ ๋น ๋ฅธ ์ˆ˜๋ ด + SC์˜ ๋‚ฎ์€ model error๋ฅผ ๋™์‹œ์— ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?

Suggestions

์ด๋ก ์  ํ”„๋ ˆ์ž„์›Œํฌ: Reasoning Error Decomposition

  • ํ•ต์‹ฌ ์•„์ด๋””์–ด: reasoning error๋ฅผ ๋‘ ๋…๋ฆฝ์ ์ธ ํ•ญ์œผ๋กœ ๋ถ„ํ•ด
    • Estimation Error: ์ถ”์ •๋œ confidence์™€ ์‹ค์ œ confidence ์‚ฌ์ด์˜ ์ฐจ์ด โ†’ ์ƒ˜ํ”Œ ์ˆ˜์™€ estimation ์ „๋žต์œผ๋กœ ์ œ์–ด ๊ฐ€๋Šฅ
    • Model Error: ์‹ค์ œ confidence์™€ ์ •๋‹ต ์—ฌ๋ถ€ ์‚ฌ์ด์˜ ์ฐจ์ด โ†’ LLM ์ž์ฒด ๋Šฅ๋ ฅ์— ์˜์กด, ์ƒ˜ํ”Œ ์ˆ˜์™€ ๋ฌด๊ด€
  • SC ๋ถ„์„ (Proposition 2)
    • Estimation Error = Bernoulli ๋ถ„์‚ฐ โ†’ linear convergence: ์ƒ˜ํ”Œ์„ ๋‘ ๋ฐฐ ์จ์•ผ error๊ฐ€ ์ ˆ๋ฐ˜
    • Model Error = answer-level aggregation ๋•๋ถ„์— ๋‚ฎ๊ฒŒ ์œ ์ง€
  • PPL ๋ถ„์„ (Proposition 3)
    • Estimation Error = exponential convergence: ์ƒ˜ํ”Œ์ด ์กฐ๊ธˆ๋งŒ ๋Š˜์–ด๋„ error๊ฐ€ ๋น ๋ฅด๊ฒŒ ์ค„์–ด๋“ฆ
    • ๋‹จ, path probability๊ฐ€ 0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก exponential ์ด์ ์ด ์‚ฌ๋ผ์ง€๊ณ  linear ์ˆ˜์ค€์œผ๋กœ degrade
    • Model Error = path ๋‹จ์œ„ ํ‰๊ฐ€๋กœ SC๋ณด๋‹ค ๋†’์Œ (์ด๋ก ์ ์œผ๋กœ ์ฆ๋ช…)
Method Estimation Error ์ˆ˜๋ ด Model Error ํ•ต์‹ฌ ๋ฌธ์ œ
SC Linear ๋‚ฎ์Œ ์ƒ˜ํ”Œ ์ˆ˜ ๋งŽ์ด ํ•„์š”
PPL Exponential (degrade ์žˆ์Œ) ๋†’์Œ ์–ด๋ ค์šด ๋ฌธ์ œ์—์„œ degrade
RPC Exponential ๋‚ฎ์Œ โ€”

Method: RPC (Reasoning-pruning Perplexity Consistency)

  • RPC: ๋‘ ๊ฐœ์˜ sequential component๋กœ ๊ตฌ์„ฑ๋œ post-hoc confidence estimation ๋ฐฉ๋ฒ•

Component 1: Perplexity Consistency (PC)

  • ์•„์ด๋””์–ด: PPL์ฒ˜๋Ÿผ ๋‚ด๋ถ€ ํ™•๋ฅ ์„ ์“ฐ๋˜, SC์ฒ˜๋Ÿผ ๊ฐ™์€ ๋‹ต์„ ๋‚ด๋Š” path๋“ค์˜ ํ™•๋ฅ ์„ ํ•ฉ์‚ฐ
    • ๊ฐ™์€ ๋‹ต ลท๋ฅผ ์ƒ์„ฑํ•œ ๋ชจ๋“  sampled path์˜ ํ™•๋ฅ ์„ ๋”ํ•ด ํ•ด๋‹น ๋‹ต์˜ confidence๋กœ ์‚ฌ์šฉ
    • Confidence(ลท) = ฮฃ p(tฬƒ|x) for all retained paths where g(tฬƒ) = ลท
  • ํšจ๊ณผ (Theorem 4)
    • Estimation Error: SC์ฒ˜๋Ÿผ answer-level aggregation์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ PPL์ฒ˜๋Ÿผ exponential convergence ๋‹ฌ์„ฑ
    • Model Error: SC์™€ ๋™์ผ ์ˆ˜์ค€์œผ๋กœ ๋‚ฎ๊ฒŒ ์œ ์ง€
    • ๋‹จ, path probability๊ฐ€ ๊ทน๋„๋กœ ๋‚ฎ์€ ๊ฒฝ์šฐ ์—ฌ์ „ํžˆ degeneration ๋ฐœ์ƒ ๊ฐ€๋Šฅ โ†’ RP๋กœ ํ•ด๊ฒฐ

Component 2: Reasoning Pruning (RP)

  • ์•„์ด๋””์–ด: ๋ชจ๋ธ ์Šค์Šค๋กœ near-zero probability๋ฅผ ๋ถ€์—ฌํ•œ path๋Š” ์ •๋‹ต์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ์œผ๋ฏ€๋กœ, PC ์‹คํ–‰ ์ „์— ๋ฏธ๋ฆฌ ์ œ๊ฑฐ
  • ์ž๋™ threshold ๊ฒฐ์ •: sampled path๋“ค์˜ probability ๋ถ„ํฌ๋ฅผ 2-component Weibull mixture๋กœ ๋ชจ๋ธ๋ง
    • ๋ถ„ํฌ๋ฅผ high-probability ์˜์—ญ๊ณผ low-probability ์˜์—ญ์œผ๋กœ ์ž๋™ ๋ถ„๋ฆฌ
    • P_High < 0.5์ด๋ฉด์„œ ์ „์ฒด mean๋ณด๋‹ค ๋‚ฎ์€ probability๋ฅผ ๊ฐ€์ง„ path๋ฅผ pruning
    • threshold๋ฅผ ์ˆ˜๋™์œผ๋กœ ์„ค์ •ํ•  ํ•„์š” ์—†๋Š” hyperparameter-free ๋ฐฉ์‹
  • ํšจ๊ณผ (Theorem 7): optimal threshold ์‚ฌ์šฉ ์‹œ ๋†’์€ ํ™•๋ฅ ๋กœ optimal error reduction ๋‹ฌ์„ฑ ๋ณด์žฅ
    • noise path ์ œ๊ฑฐ โ†’ PC์˜ degeneration ๋ฌธ์ œ ํ•ด์†Œ
    • incorrect path ์ค‘ low-probability์ธ ๊ฒƒ๋“ค์ด ์ œ๊ฑฐ๋˜๋ฉด์„œ model error๋„ ํ•จ๊ป˜ ๊ฐ์†Œ

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • Phase 1 (RP): Weibull mixture ํ”ผํŒ… โ†’ low-probability path ์ œ๊ฑฐ
  • Phase 2 (PC): ๋‚จ์€ path๋“ค์— ๋Œ€ํ•ด answer๋ณ„ ํ™•๋ฅ  ํ•ฉ์‚ฐ โ†’ ๊ฐ€์žฅ ๋†’์€ confidence์˜ ๋‹ต ์„ ํƒ
  • ์ถ”๊ฐ€ overhead: MathOdyssey 128-sample ๊ธฐ์ค€ SC 0.006s/q โ†’ RPC 0.036s/q (LLM inference ๋Œ€๋น„ ๋ฌด์‹œ ๊ฐ€๋Šฅ)

Effects

  • Experimental Setup
    • Models: InternLM2-Math-Plus 1.8B/7B, DeepSeekMath-RL 7B, DeepSeek-Coder 33B, DeepSeek-R1-Distill-Qwen-7B
    • Datasets: MATH, MathOdyssey, OlympiadBench, AIME (์ˆ˜ํ•™ 4์ข…), HumanEval, MBPP, APPS (์ฝ”๋“œ 3์ข…), GPQA, LogiQA
    • Baselines: PPL, SC, Verbalized Confidence (VERB)
    • Metrics: Accuracy โ†‘, ECE (Expected Calibration Error) โ†“, sampling budget โ†“
    • ๊ฐ ์‹คํ—˜ 10 random seed๋กœ ๋ฐ˜๋ณต, A800/H800 GPU
  • Results
    • RQ1 (Efficiency): Table 1 โ€” SC์˜ best ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜ ๋น„๊ต
      • RPC๋Š” 4๊ฐœ ์ˆ˜ํ•™ benchmark ์ „์ฒด์—์„œ SC ๋Œ€๋น„ 50~71% ์ƒ˜ํ”Œ ์ ˆ๊ฐํ•˜๋ฉด์„œ ๋™๋“ฑ ์ด์ƒ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
      • MathOdyssey: 112๊ฐœ โ†’ 32๊ฐœ (-71.4%)๋กœ ๊ฐ€์žฅ ํฐ ์ ˆ๊ฐ ํญ
      • PC๋งŒ์œผ๋กœ๋Š” degeneration์œผ๋กœ ์ผ๋ถ€ ์ ˆ๊ฐ ์‹คํŒจ; RP ์ถ”๊ฐ€ ์‹œ ์ „ dataset์—์„œ ์ผ๊ด€๋œ ๊ฐœ์„ 
    • RQ2 (Efficacy): Figure 2 โ€” ์ƒ˜ํ”Œ ์ˆ˜์— ๋”ฐ๋ฅธ accuracy ๋ณ€ํ™” ๊ณก์„ 
      • ๋ชจ๋“  ์ƒ˜ํ”Œ ์˜ˆ์‚ฐ ๊ตฌ๊ฐ„์—์„œ RPC > PC > SC > PPL ์ˆœ์„œ ์ผ๊ด€
      • PPL์€ model error ๋†’์•„ early plateau; RPC๋Š” ์ด๋ฅผ ํ”ผํ•˜๋ฉด์„œ accuracy ceiling๋„ ๋†’์Œ
      • Table 2 (InternLM2-Math-Plus 7B) ํ‰๊ท : RPC 26.11% / SC 24.82%
      • Table 3์—์„œ 1.8B ๋ฐ DeepSeekMath-RL 7B์—์„œ๋„ ๋™์ผํ•œ ๊ฒฝํ–ฅ ํ™•์ธ
    • RQ3 (Reliability): Table 2 โ€” Accuracy + ECE ๋™์‹œ ๋น„๊ต
      • PPL: ECE ํ‰๊ท  73.14 โ†’ ์™„์ „ํžˆ miscalibrated
      • VERB: accuracy๋„, ECE๋„ ๋ชจ๋‘ ์ตœํ•˜์œ„
      • SC: ECE 13.37๋กœ reasonableํ•˜์ง€๋งŒ RPC์— ๋ฏธ์น˜์ง€ ๋ชปํ•จ
      • RPC: accuracy 26.11% + ECE 12.37 โ†’ accuracy, calibration ๋ชจ๋‘ ์ตœ๊ณ 
      • Figure 3 (reliability diagram): RPC์˜ predicted confidence๊ฐ€ ์‹ค์ œ accuracy์— ํ›จ์”ฌ ์ž˜ ์ •๋ ฌ๋จ
    • Additional Results
      • Figure 4: ์ฝ”๋“œ ์ƒ์„ฑ 3์ข… (HumanEval, MBPP, APPS)์—์„œ๋„ ๋ชจ๋“  baseline ์ƒํšŒ
      • R1 thinking model (DeepSeek-R1-Distill-Qwen-7B)์—์„œ๋„ RPC ํšจ๊ณผ ์œ ์ง€ (Table 5)
      • ESC, BoN+reward model ๋“ฑ advanced baseline๊ณผ ๊ฒฐํ•ฉํ•ด๋„ ์ผ๊ด€๋œ ์„ฑ๋Šฅ ํ–ฅ์ƒ (Tables 6, 7)
      • ๋†’์€ sampling temperature (1.1, 1.3)์—์„œ๋„ RPC robust; SC๋Š” ๊ณ ์˜จ์—์„œ estimation error ์ฆ๊ฐ€๋กœ ์„ฑ๋Šฅ ์ €ํ•˜

Personal note. ์ด๋ก ์ ์œผ๋กœ๋Š” ๋ฌด์—‡์ด ๋‹ค๋ฅธ๊ฐ€์— ๋Œ€ํ•œ ๋‹ต์„ ์ฒ˜์Œ์œผ๋กœ ์ œ์‹œํ–ˆ๋‹ค๋Š” ๊ฒŒ ์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ๊ฐ™๊ธฐ๋Š” ํ•œ๋ฐ, ๊ฐ€๋ น Estimation Error (์ˆ˜๋ ด ์†๋„) vs. Model Error (์ง‘๊ณ„ ๋ฐฉ์‹)๋ผ๋Š” ๋‘ ์ถ•์œผ๋กœ ์ •๋ฆฌ๋˜๋‹ˆ๊นŒ ์กฐ๊ธˆ ํฌ๋ฏธํ•˜๋˜ confidence estimation ์—ฐ๊ตฌ์— ๋Œ€ํ•œ ์ปจ์…‰์ด ์ „๊ตฌ ๋ถˆ ๋ฐํžˆ๋Š” ๊ธฐ๋ถ„์ด ๋“ค๊ธดํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋ง์”€๋ ธ๋‹ค์‹œํ”ผ ์ด๋ก ์  ๋…ผ์ฆ์„ ์ž˜ ๋ฐํžŒ ๊ฒƒ ์น˜๊ณ ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ œ์•ˆ ๋ฐฉ๋ฒ•์ด ๋˜๋ ทํ•œ ๋А๋‚Œ์€ ๋ฐ›์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค.