4 minute read

Meta info.
  • Authors: Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang, Chao-Han Huck Yang, Qingsong Wen, Zuozhu Liu, Sabato Marco Siniscalchi, Ming Jin, Shirui Pan
  • Paper: https://arxiv.org/abs/2509.24803
  • Code: https://github.com/AntonGuan/TimeOmni-1
  • Affiliation: Griffith University, Zhejiang University, NVIDIA Research, Squirrel Ai Learning, University of Palermo, NTNU
  • Conference: ICLR 2026

TL; DR

์‹œ๊ณ„์—ด ์ถ”๋ก ์„ ์œ„ํ•œ ํฌ๊ด„์  benchmark TSR-Suite์™€ SFT+RL 2๋‹จ๊ณ„ ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ํ†ตํ•ฉ ์ถ”๋ก  ๋ชจ๋ธ TimeOmni-1์„ ์ œ์•ˆ. GPT-4.1 ๋Œ€๋น„ causal discovery ์ •ํ™•๋„ +40.6% ๋‹ฌ์„ฑ

Figure 0 Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10

Background

  • ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋Š” ์—๋„ˆ์ง€, ๊ธˆ์œต ๋“ฑ ์‹ค์„ธ๊ณ„ ์ „๋ฐ˜์— ๊ฑธ์ณ ์‚ฌ์šฉ๋˜์ง€๋งŒ, LLM์€ pretraining ๊ณผ์ •์—์„œ temporal prior๋ฅผ ๊ฑฐ์˜ ์Šต๋“ํ•˜์ง€ ๋ชปํ•จ
  • ๊ธฐ์กด ์ ‘๊ทผ์˜ ๋ฐœ์ „ ๋ฐฉํ–ฅ:
    • TSFM (Moirai, Time-MoE, Chronos ๋“ฑ): ๋Œ€๊ทœ๋ชจ pretraining ๊ธฐ๋ฐ˜ forecasting foundation model. textual event ์ฒ˜๋ฆฌ ๋ถˆ๊ฐ€, multi-task ๋ถˆ๊ฐ€
    • TSLM (ChatTS, Time-MQA ๋“ฑ): LLM์„ ์‹œ๊ณ„์—ด QA์— ์ ์‘์‹œํ‚ค์ง€๋งŒ, ํŒจํ„ด ๋งค์นญ ์ˆ˜์ค€์œผ๋กœ ์ง„์ •ํ•œ reasoning ์—†์Œ
    • ์ตœ๊ทผ DeepSeek-R1 ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ฐจ์šฉํ•œ TSRM (Time-R1 ๋“ฑ)์ด ๋“ฑ์žฅํ–ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ single-task ์‹คํ—˜์— ๊ตญํ•œ
  • ๊ธฐ์กด TSQA ๋ฐ์ดํ„ฐ์…‹(Time-MQA)์˜ ๊ตฌ์กฐ์  ํ•œ๊ณ„ (Fig 1):
    • ์ถ”๋ก  ํ•„์š”์„ฑ ์—†์Œ: reasoning model์ด non-reasoning model ๋Œ€๋น„ ์ด๋“ ์—†๊ณ , ๋ชจ๋“  ๋ชจ๋ธ์ด 75% ์ด์ƒ ๋‹ฌ์„ฑ โ†’ ๋„ˆ๋ฌด ์‰ฌ์šด task (Fig 1 (a), (b))
    • ๋งฅ๋ฝ ๋ถˆ์ถฉ๋ถ„: โ€œhigh vs. low volatilityโ€ ๊ฒฝ๊ณ„ ์—†๋Š” ๋ชจํ˜ธํ•œ ์„ ํƒ์ง€ โ†’ ์ถ”๋ก ์ด ์•„๋‹Œ ์ถ”์ธก์„ ๊ฐ•์ œ; SFT ํ›„์—๋„ 65% ์ดํ•˜์—์„œ plateau (Fig 1 (c), (d))

Problem States

์ง„์ •ํ•œ reasoning์„ ์š”๊ตฌํ•˜๋Š” task ์„ค๊ณ„๋ฅผ ์œ„ํ•œ ๋‘ ๊ฐ€์ง€ ์›์น™ ์ œ์•ˆ:

  • Principle 1 (์ถ”๋ก  ํ•„์š”์„ฑ): reasoning model(RM)์ด non-reasoning model(NRM)์„ ์œ ์˜๋ฏธํ•˜๊ฒŒ ์•ž์„œ์•ผ ํ•จ
  • Principle 2 (๋งฅ๋ฝ ์ถฉ๋ถ„์„ฑ): reasoning capacity๊ฐ€ ๋ฌดํ•œํ•ด๋„ ๋งฅ๋ฝ์ด ๋ถ€์กฑํ•˜๋ฉด random guess ์ˆ˜์ค€

๋‘ ์›์น™์—์„œ ๋„์ถœ๋˜๋Š” ํ•ด๊ฒฐ ๊ณผ์ œ:

  1. ๋‘ ์›์น™์„ ๋™์‹œ์— ๋งŒ์กฑํ•˜๋Š” reasoning-critical time series ๋ฐ์ดํ„ฐ ๋ถ€์žฌ
  2. ๋ฒ”์šฉ TSRM ํ•™์Šต์„ ์œ„ํ•œ ๊ฒ€์ฆ๋œ ํ›ˆ๋ จ ๊ฒฝ๋กœ ๋ถ€์žฌ (๊ธฐ์กด ์ ‘๊ทผ์€ task/dataset๋งˆ๋‹ค ๊ฐœ๋ณ„ ๋ชจ๋ธ)

Suggestions

Problem Formulation

Time series reasoning์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜: ์ค‘๊ฐ„ rationale R์„ ์ƒ์„ฑํ•œ ํ›„ ์ตœ์ข… ๋‹ต๋ณ€ y๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๊ณผ์ •

\[(R, y) \sim p_\theta(R, y \mid X, C) = p_\theta(R \mid X, C) \cdot p_\theta(y \mid R, X, C)\]
  • RM: <think> โ€ฆ </think> <answer> โ€ฆ </answer> ํ˜•์‹์œผ๋กœ ์ถœ๋ ฅ
  • NRM: <answer> โ€ฆ </answer> ๋งŒ ์ถœ๋ ฅ

์ด ๋ถ„๋ฆฌ ๋•๋ถ„์— ๋‘ ๋ชจ๋ธ ๊ฐ„ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋ช…ํ™•ํžˆ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๊ณ , Principle 1์˜ ์ •๋Ÿ‰ ๊ฒ€์ฆ์ด ๊ฐ€๋Šฅํ•ด์ง

TSR-Suite: Time Series Reasoning Suite

perception โ†’ extrapolation โ†’ decision-making 3๋‹จ๊ณ„ ์ธ์ง€ ๋Šฅ๋ ฅ์„ ์ปค๋ฒ„ํ•˜๋Š” 4๊ฐœ task๋กœ ๊ตฌ์„ฑ (Fig 2):

  • Task 1 [Scenario Understanding]: Perception ๋Šฅ๋ ฅ / Multi-domain / Multi-choice
  • Task 2 [Causality Discovery]: Perception ๋Šฅ๋ ฅ / River discharge (CausalRivers) ๋„๋ฉ”์ธ / Multi-choice
    • ์ถ”๋ก  ํ๋ฆ„: trend consistency โ†’ key fluctuation alignment โ†’ causal direction (โ€œsmall rivers flow into big riversโ€ ๋„๋ฉ”์ธ ๊ทœ์น™)
  • Task 3 [Event-aware Forecasting]: Extrapolation ๋Šฅ๋ ฅ / ์ธ๊ฐ„ ์ด๋™ยท์ „๋ ฅ ๋ถ€ํ•˜ ๋„๋ฉ”์ธ / Sequence output
  • Task 4 [Decision Making]: Decision-making ๋Šฅ๋ ฅ / ๊ฑด๋ฌผ ์—๋„ˆ์ง€ (CityLearn) ๋„๋ฉ”์ธ / Multi-choice

Hierarchical CoT Annotation ํŒŒ์ดํ”„๋ผ์ธ (Fig 3):

LLM Analyzer (๊ตฌ์กฐํ™” ํ…œํ”Œ๋ฆฟ ์‚ฌ์šฉ) โ†’ Human Reviewer (๋งฅ๋ฝ ์ถฉ๋ถ„์„ฑ ๊ฒ€ํ† ) โ†’ LLM Rewriter

  • LLM๊ณผ ์ธ๊ฐ„ ๋ชจ๋‘ ์‹คํŒจํ•œ ์ƒ˜ํ”Œ์€ ํ๊ธฐ
  • Task 3 ํŠน์ด์‚ฌํ•ญ: ground-truth hint๋กœ ์ƒ์„ฑํ•œ chain์ด ์˜คํžˆ๋ ค SFT ์„ฑ๋Šฅ ํ•˜๋ฝ
    • Tab 6 ๊ธฐ์ค€ ID MAE 24.53 (hint ์‚ฌ์šฉ) vs. 15.10 (LLM ์ž์ฒด ์ƒ์„ฑ)
    • curriculum learning ์›์น™๊ณผ ์ผ์น˜: ๋ชจ๋ธ ํ˜„์žฌ ๋Šฅ๋ ฅ๋ณด๋‹ค ์•ฝ๊ฐ„ ์–ด๋ ค์šด ๋ฐ์ดํ„ฐ๊ฐ€ ์ตœ์ 

TimeOmni-1: 2๋‹จ๊ณ„ ํ•™์Šต

Stage 1 โ€” SFT๋กœ temporal prior ์ฃผ์ž…

๊ณ„์ธต์  CoT ๋ฐ์ดํ„ฐ๋กœ 4๊ฐœ task์— ๋Œ€ํ•œ CoT-SFT ์ˆ˜ํ–‰

  • Finding #1: <1K seed ์ƒ˜ํ”Œ๋งŒ์œผ๋กœ Task 2 ์ •ํ™•๋„ +46.1% (base ๋ชจ๋ธ 21.6% = random guess 33.3% ๋ฏธ๋งŒ์œผ๋กœ ๋ถ•๊ดด)
  • Finding #2: ์ธ๊ฐ„์ด ์„ค๊ณ„ํ•œ ๊ตฌ์กฐํ™” ํ…œํ”Œ๋ฆฟ์ด ํ•ต์‹ฌ โ€” GPT-4.1 zero-shot Task 2: 28.7% โ†’ ํ…œํ”Œ๋ฆฟ ์ ์šฉ ์‹œ 71.1% (Fig 5)

Stage 2 โ€” RL (GRPO)๋กœ ์ถ”๋ก  ์ •์ œ

Task ๋งž์ถคํ˜• outcome-based reward ์„ค๊ณ„:

  • $R_{\text{format}}$: <think> โ€ฆ </think> <answer> โ€ฆ </answer> ํ˜•์‹ ์ค€์ˆ˜
  • $R_{\text{discrete}} \in {0, 1}$: Task 1/2/4 exact match
  • $R_{\text{count}} = 0.1$: Task 3 ์‹œํ€€์Šค ๊ธธ์ด ์ผ์น˜ ๋ณด๋„ˆ์Šค (Stage 1 ์ฒดํฌํฌ์ธํŠธ์—์„œ ๊ธธ์ด ์„ฑ๊ณต๋ฅ ์ด 55.7%์— ๋ถˆ๊ณผํ–ˆ๊ธฐ ๋•Œ๋ฌธ)
  • Task 3 MAE โ†’ exponential decay๋กœ ์ •๊ทœํ™”๋œ ๋ณด์ƒ ๋ฒ”์œ„์— ๋งคํ•‘

  • Finding #3: Stage 1 ์—†์ด RL๋งŒ ์ ์šฉ ์‹œ ํšจ๊ณผ ๋ฏธ๋ฏธํ•˜๊ฑฐ๋‚˜ ์˜คํžˆ๋ ค ํ•˜๋ฝ (Task 4: -5.3%) โ†’ Stage 1 prior๊ฐ€ ์ „์ œ์กฐ๊ฑด (Fig 6)

Joint Training

4๊ฐœ task ํ†ตํ•ฉ ํ•™์Šต์œผ๋กœ โ€œtrain-once, use-across-tasksโ€ ํŒจ๋Ÿฌ๋‹ค์ž„ ์‹ค์ฆ (Fig 7):

  • zero-shot capability transfer: decision-making ACC 25.5% โ†’ 26.2% โ†’ 31.3% (perception/extrapolation prior ์ˆœ์ฐจ ์ถ”๊ฐ€ ์‹œ, Fig 7(a))
  • supervised capability supplement: 40.9% โ†’ 45.7% โ†’ 47.9% (Fig 7(b))
  • ๊ธฐ์กด single-task ํŒŒ์ดํ”„๋ผ์ธ(TimeMaster: 6๊ฐœ ๋ฐ์ดํ„ฐ์…‹์— 6๊ฐœ ๋ชจ๋ธ ๋ณ„๋„)๊ณผ ๋Œ€๋น„ Finding #4 ์‹ค์ฆ (Fig 7(c))

Effects

Experimental Setup

  • Base Model: Qwen2.5-7B-Instruct
  • Time-series input: ์‹œ๊ณ„์—ด ๊ฐ’์„ ํ…์ŠคํŠธ๋กœ ์ง๋ ฌํ™” (serialization)
    • ViT์— ํ•ด๋‹นํ•˜๋Š” ๋ฒ”์šฉ time series encoder๊ฐ€ ์•„์ง ์—†๊ธฐ ๋•Œ๋ฌธ; Time-R1, Time-MQA์™€ ๋™์ผํ•œ ๋ฐฉ์‹
  • External Benchmarks:
    • MTBench: ์‹ค์ œ ์ฃผ์‹ยท๊ธฐ์ƒ ์‹œ๊ณ„์—ด, ์‹œ๊ฐ„ ๋ฒ”์œ„๋ณ„ QA
    • TimeSeriesExam: ํ•ฉ์„ฑ ์‹œ๊ณ„์—ด 5๊ฐœ task (ground truth ๋ช…ํ™•์„ฑ์„ ์œ„ํ•ด ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ)
    • CaTS-Bench: ์‹œ๊ณ„์—ดโ†”์ž์—ฐ์–ด alignment ๋Šฅ๋ ฅ ์ธก์ •, retrieval ๋ฅ˜ task
    • DROP / GPQA / ReClor: ์ˆ˜์น˜ ์ถ”๋ก  / ๋Œ€ํ•™์›๊ธ‰ ์ „๋ฌธ ์ง€์‹ / ๋…ผ๋ฆฌ ์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ
  • Metrics: ๋ชจ๋“  ์ง€ํ‘œ๋Š” valid response์— ๋Œ€ํ•ด์„œ๋งŒ ์‚ฐ์ •
    • Success Rate (SR): ์œ ํšจ ์‘๋‹ต ๋น„์œจ โ€” ํŠนํ™” ๋ชจ๋ธ๋“ค์˜ ์žฆ์€ ํฌ๋งท ์‹คํŒจ ๋•Œ๋ฌธ์— ๋ณ„๋„ ๋ณด๊ณ  (ChatTS: Task 3์—์„œ SR 0%)
    • ACC (Task 1/2/4): exact match
    • MAE (Task 3): ๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ

Results

Main Table (Tab 1)

  • GPT-4.1 ๋Œ€๋น„ causal discovery ์ •ํ™•๋„: +40.6% (ID) / +28.1% (OOD)
  • ๊ธฐ์กด TS ํŠนํ™” ๋ชจ๋ธ๋“ค์˜ ๋‚ฎ์€ SR์ด ๋‘๋“œ๋Ÿฌ์ง: ChatTS๋Š” Task 3์—์„œ SR 0% (์ˆซ์ž ์‹œํ€€์Šค ๋Œ€์‹  ์ž์œ  ํ˜•์‹ ํ…์ŠคํŠธ ์ƒ์„ฑ)
  • Task 3 OOD MAE(145.53) vs. ID MAE(14.30): NYC ํƒ์‹œ โ†’ ์ „๋ ฅ ๋ถ€ํ•˜ ๋„๋ฉ”์ธ ์‹œํ”„ํŠธ์—์„œ ์—ฌ์ „ํžˆ ์ฐจ์ด๊ฐ€ ํฐ ๊ฒƒ ํ™•์ธ

ํ•™์Šต ๋‹จ๊ณ„๋ณ„ ablation (Tab 2)

  • Task 2 ANS-SFT vs. CoT-SFT: 30.5% vs. 67.7%
    • answer-only supervision์€ ๋ถ„ํฌ๋งŒ ๋งž์ถ”๊ณ  ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ธฐ๋ฅด์ง€ ๋ชปํ•จ
  • CoT-SFT + RL: ์ „ task์—์„œ ๊ฐ€์žฅ ๊ท ํ˜• ์žกํžŒ ์„ฑ๋Šฅ

์ผ๋ฐ˜ ์ถ”๋ก  ๋Šฅ๋ ฅ ๋ณด์กด (Fig 8)

  • DROP(์ˆ˜์น˜ ์ถ”๋ก ), GPQA(๋Œ€ํ•™์› ์ˆ˜์ค€), ReClor(๋…ผ๋ฆฌ ์ถ”๋ก )์—์„œ base ๋Œ€๋น„ ํ‰๊ท  +16.5%
  • ์‹œ๊ณ„์—ด ํŠนํ™” ํ•™์Šต์ด ์ผ๋ฐ˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ•ด์น˜์ง€ ์•Š์Œ์„ ํ™•์ธ

Personal note. ์ด ๋…ผ๋ฌธ์˜ ์˜์˜๋Š” ์ง„์งœ ์ถ”๋ก ์„ ์š”๊ตฌํ•˜๋Š” task๊ฐ€ ๋ฌด์—‡์ธ์ง€๋ฅผ ์—„๋ฐ€ํ•˜๊ฒŒ ์ •์˜ํ•˜๋ ค ํ•œ ์‹œ๋„๊ฐ€ ์•„๋‹Œ๊ฐ€ ์‹ถ์Šต๋‹ˆ๋‹ค. ๋ณ„๊ฑด์ผ ์ˆ˜๋„ ์žˆ๊ธด ํ•œ๋ฐ, ์‹œ๊ณ„์—ด ์ถ”๋ก ๊ณผ personalized tool-calling์€ ํ‘œ๋ฉด์ ์œผ๋กœ ๋งค์šฐ ๋‹ค๋ฅธ ๋ฌธ์ œ์ง€๋งŒ, ํ•ต์‹ฌ ๊ตฌ์กฐ๊ฐ€ ๋™์ผํ•˜๋‹ค๊ณ ๋„ ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ด€์ธก ํžˆ์Šคํ† ๋ฆฌ์—์„œ ์ž ์žฌ ํŒจํ„ด์„ ์ถ”๋ก ํ•˜๊ณ  ๊ทธ ํŒจํ„ด์„ ๋ฏธ๋ž˜ ํ–‰๋™์— ๋ฐ˜์˜ํ•˜๋Š” ๊ฑฐ๋ผ๊ณ  ๋ณผ ์ˆ˜๋„ ์žˆ์œผ๋ฏ€๋กœโ€ฆ? ์ œ์•ˆ๋œ TSR-Suite์˜ Task 4(Decision Making)์ด ๊ฒฐ๊ตญ ๊ณผ๊ฑฐ ์‹œ๊ณ„์—ด ํŒจํ„ด์„ ์ถ”๋ก ํ•ด ์ตœ์  ์ „๋žต์„ ์„ ํƒํ•˜๋Š” ๊ตฌ์กฐ๋Š”, Preference Inference/Transfer๊ฐ€ ๊ณผ๊ฑฐ ์„ธ์…˜ ํžˆ์Šคํ† ๋ฆฌ์—์„œ ์ž ์žฌ ์„ ํ˜ธ๋„๋ฅผ ์ถ”๋ก ํ•ด API ์ธ์ž๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ตฌ์กฐ์™€ ๋™ํ˜•์ด์ง€ ์•Š์„๊นŒ ์‹ถ๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.