Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling
Key Summary
- •Timer-S1 is a huge time-series model (8.3B parameters, only 0.75B used per step) that predicts the future by thinking step-by-step inside one forward pass.
- •Its key idea, Serial-Token Prediction (STP), adds the missing serial computations that long-term forecasting needs without slow rolling autoregression.
- •Timer-S1 mixes two special block types: TimeMoE (a sparse Mixture-of-Experts for diverse patterns) and TimeSTP (serial blocks that pass and refine intermediate steps).
- •It learns from TimeBench, a curated dataset with over one trillion time points plus smart augmentations (resampling, value-flipping) to reduce bias.
- •A two-stage pipeline boosts results: pre-training with dense supervision across horizons, then post-training that prioritizes short-term accuracy and extends context to 11.5K.
- •On the large GIFT-Eval benchmark, Timer-S1 achieves state-of-the-art results (MASE 0.693, CRPS 0.485), especially shining on medium and long horizons.
- •Serial Scaling (architecture + data + training) lets the model get better as horizons grow, without the usual error pile-up.
- •Compared to next-token and multi-token training, STP delivers stronger accuracy and faster multi-step inference.
- •Ablations show each piece matters: keeping TimeSTP at inference, using augmentation, and pre-training on TimeBench all improve performance.
- •Limitations include handling extra outside signals (covariates) and fully mastering multivariate structure, but the approach opens clear paths forward.
Why This Research Matters
Better forecasts touch everyday life: steadier power for homes, smoother hospital staffing, fewer late deliveries, and safer city traffic. Timer-S1 shows we don’t have to choose between accuracy and speed for long-range predictions—we can have both by computing the serial chain inside the model once. Its uncertainty-aware outputs help decision-makers plan for best, typical, and worst cases. Because it’s a foundation model trained on a vast and diverse dataset, it can work out-of-the-box across many domains. This raises the baseline for what organizations can expect from forecasting systems. And by revealing a clear path to scaling (architecture, data, and training together), it sets a template others can build on.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how predicting tomorrow’s weather or next week’s traffic needs you to use what happened just before? Long-range guesses depend on shorter-range guesses being right. That “one step leads to the next” chain makes time forecasting special. Before this paper, AI models for time series had grown better but hit a wall when trying to scale up and still stay accurate over many future steps. Many models either predicted all future steps at once (fast but often missed the step-by-step logic) or predicted one step at a time (faithful to the chain, but slow and error-prone as mistakes snowballed).
What was the world like before? Time-series models came from classic stats (like ARIMA), then machine learning (like trees), then deep learning (like RNNs and Transformers). Recently, “foundation models” promised one model for many datasets, trained once and used anywhere. But time series brought extra twists: different domains behave very differently, signals have many speeds (daily, weekly, seasonal), and the future can change suddenly. That variety made it hard to build one big model that scales cleanly.
The specific problem: Scaling. When models got bigger, they didn’t automatically get better for long horizons. Autoregressive models predicted step-by-step like real life, but for long forecasts they had to “roll” many times—slow and prone to accumulating errors. Parallel models predicted many steps at once—fast!—but they often skipped the important serial reasoning (what happens at step 2 depends on step 1’s outcome), so accuracy suffered over long ranges.
What did people try? 1) Next-token prediction (just the next piece, then roll again). It fit the serial nature but needed lots of repeated inference and could compound small mistakes. 2) Multi-token prediction (predict several future pieces at once). Faster, but it weakened the serial chain and often struggled with long horizons. 3) Borrowed LLM tricks like caching to speed up rolling; helpful, but time series’ noisiness still made error buildup a big deal. 4) Early MoE models tried scaling with experts, but without a serial-aware training and inference plan, benefits were limited.
The gap: A way to keep the serial logic that long-term forecasting truly needs, yet avoid the slow, error-prone rolling. Also, a way to scale data and training that respects the different needs of short vs. long horizons.
Why should anyone care? Forecasts guide real decisions: how much energy to generate tonight, which medicines to reorder this week, when a machine might fail, how much money to set aside for next quarter. If your model ignores the serial chain, long-range plans wobble. If it rolls step-by-step too slowly or compounds errors, results come late or off-target. A method that is both serial-smart and efficient means better schedules, safer hospitals, steadier power grids, and lower costs in daily life.
New concept 1 — Mixture-of-Experts (MoE) 🍞 Hook: Imagine a school where each teacher is a specialist—one for math, one for art, one for music. You don’t ask every teacher every question; you go to the right expert. 🥬 The Concept: MoE is a model where different “experts” handle different kinds of inputs, and a gate picks the best ones for each input.
- How it works:
- A router looks at the input patch and scores experts.
- Only a few top experts are activated.
- Their outputs are combined into the final answer.
- Why it matters: Without MoE, one giant “generalist” has to learn everything, which is hard for diverse time series. MoE scales the model while keeping computation efficient. 🍞 Anchor: A weather patch with strong daily patterns goes to a “seasonality expert”; a sudden spike from a sensor goes to an “anomaly expert”.
New concept 2 — Serial Scaling 🍞 Hook: You know how you line up dominoes so each one tips the next in order? The sequence matters. 🥬 The Concept: Serial Scaling means building the model and training so it explicitly computes step-by-step reasoning across future steps.
- How it works:
- Add blocks that refine predictions in order of horizons.
- Keep those blocks at inference (don’t throw them away).
- Scale data and training to support this ordered reasoning.
- Why it matters: Without it, long-term predictions miss the “what came just before?” logic and drift off target. 🍞 Anchor: Planning a road trip: you decide today’s drive, then tomorrow’s based on where you ended today—not all days at once.
New concept 3 — Serial-Token Prediction (STP) 🍞 Hook: Picture guessing the next frames of a cartoon. The best guess for frame 3 depends on what happened in frames 1 and 2. 🥬 The Concept: STP trains the model to produce future tokens in a serial chain within one forward pass, refining each step as it goes.
- How it works:
- The model first encodes the past.
- A special series of blocks refines predictions horizon by horizon.
- Each deeper block handles a farther horizon, reusing previous steps.
- Why it matters: Without STP, you either roll slowly step-by-step or guess everything at once and lose serial logic. STP keeps the chain and stays efficient. 🍞 Anchor: Building a tower of blocks: place the first block, check it’s stable, then place the second, and so on—faster than rebuilding from scratch each time.
02Core Idea
The Aha! Moment: Let the model do the step-by-step thinking for future steps inside the network—and keep those steps alive during inference—so long-term forecasts are both faithful to the serial nature and fast.
Analogy 1 — Assembly line 🍞 Hook: Think of a factory where a toy moves down a line and gets improved at each station. 🥬 The Concept: Each TimeSTP block is a station that adds more detail for a farther future step.
- How it works:
- Past is embedded.
- The first TimeSTP block makes a near-future guess.
- The next block refines for a farther step, reusing earlier info.
- Why it matters: Skipping stations means missing needed work; stopping the line to redo everything (rolling) is slow. 🍞 Anchor: A bike gets frame, wheels, and paint at different stations. Similarly, the forecast for step 3 builds on what steps 1 and 2 already handled.
Analogy 2 — Relay race 🍞 Hook: In a relay, runners hand off the baton smoothly so each leg benefits from the last. 🥬 The Concept: TimeSTP blocks pass refined context forward so each horizon starts from a better place.
- How it works:
- Runner 1 (near-term) sprints.
- Hands baton (refined embeddings) to Runner 2 (mid-term).
- And so on.
- Why it matters: Without handoffs, each runner starts from scratch and the team slows. 🍞 Anchor: Clean handoffs mean the team finishes faster and steadier; clean embedding handoffs mean better long-term forecasts.
Analogy 3 — Onion layers 🍞 Hook: Peeling an onion reveals deeper layers step by step. 🥬 The Concept: Deeper TimeSTP blocks reveal farther-future structure gradually.
- How it works:
- Start with surface patterns (short-term).
- Peel deeper (medium-term).
- Peel deepest (long-term).
- Why it matters: Jumping straight to the core skips necessary context and leads to wrong shapes. 🍞 Anchor: You can’t see the inner rings without peeling outer ones; you can’t reliably predict far-out steps without computing through nearer ones.
Before vs After:
- Before: Big time-series models either rolled step-by-step (slow, error can snowball) or predicted many steps at once (fast, but missed serial dependencies). Scaling hit a ceiling.
- After: Timer-S1 adds TimeSTP blocks for STP, so the model computes serially inside one pass. It respects the chain while avoiding repeated rolling, unlocking better long-horizon accuracy and speed.
Why it works (intuition):
- Long-term futures depend on the results of nearer futures. By organizing blocks so deeper ones focus on farther horizons and reuse prior steps, the model bakes the chain of reasoning into its structure. Keeping these blocks during inference avoids the train–test mismatch and error blow-ups seen when auxiliary structures are thrown away.
Building blocks (introduced with sandwich explanations):
New concept 4 — TimeMoE 🍞 Hook: Like sending tricky math problems to a math whiz and music questions to a music whiz. 🥬 The Concept: TimeMoE is a sparse MoE tailored for time series patches so the right experts process the right patterns.
- How it works:
- A router scores experts for each input patch.
- Only a few experts run (fast!).
- Their outputs are combined for a strong, specialized result.
- Why it matters: Time series vary wildly by domain; routing to specialists boosts accuracy and efficiency. 🍞 Anchor: A financial patch with volatility goes to a “volatility expert,” while a calm climate patch goes to a “seasonality expert.”
New concept 5 — TimeSTP blocks 🍞 Hook: Imagine stepping stones across a river: each stone helps you reach the next. 🥬 The Concept: TimeSTP blocks are serial stations that refine embeddings step-by-step and output farther-horizon predictions.
- How it works:
- Take the last block’s embeddings plus the original input’s memory.
- Fuse them and compute a refined state.
- Project a farther-horizon prediction; pass the refined state onward.
- Why it matters: Without these stones, you either jump across (risky, inaccurate) or inch forward very slowly. 🍞 Anchor: To reach the 4th stone, you must stand on the 3rd; to reach horizon 4, you compute through horizons 1–3.
New concept 6 — TimeBench dataset 🍞 Hook: A giant library of stories lets you learn many plots; a small shelf doesn’t. 🥬 The Concept: TimeBench is a curated trillion-point time-series corpus from many domains, cleaned and balanced for training.
- How it works:
- Gather diverse real and synthetic series.
- Clean and assess quality (impute, de-noise, test predictability).
- Sample in a way that prevents leakage and encourages variety.
- Why it matters: Without a broad, clean library, a foundation model won’t generalize. 🍞 Anchor: Learning both calm and stormy weather stories helps the model forecast new places’ skies.
New concept 7 — Data augmentation 🍞 Hook: Photographers edit the same picture (crop, rotate) to teach a model robustness; we can do similar tricks for time. 🥬 The Concept: Resampling and value-flipping create new yet truthful versions of the same series to fight bias.
- How it works:
- Resampling exposes different time resolutions.
- Value-flipping turns upward trends into downward twins.
- Train on both so the model doesn’t assume one direction or frequency.
- Why it matters: Without this, models can overfit to common frequencies or one-way trends. 🍞 Anchor: If you only ever see rising stock prices, you might wrongly expect all stocks to rise forever.
New concept 8 — Quantile loss 🍞 Hook: Instead of giving just one guess, you give a range (like “there’s a 90% chance the rain is below this”). 🥬 The Concept: Quantile loss trains the model to predict multiple quantiles (like 10%, 50%, 90%) so it can express uncertainty.
- How it works:
- The head outputs several quantile curves.
- Compare each curve to truth using a pinball-style penalty.
- Combine them so the whole forecast distribution gets better.
- Why it matters: Without uncertainty, planning gets risky; with quantiles, users can balance safety and cost. 🍞 Anchor: A hospital might staff for the 90% quantile of patients on holidays and the 50% quantile on quiet days.
03Methodology
At a high level: Past time series → Normalize and pack into patches → Transformer with TimeMoE stacks → Serial TimeSTP blocks refine future steps → Quantile head outputs multi-step forecasts in one pass.
Step 1: Normalization and patching
- What happens: Each univariate series is normalized per instance to reduce scale differences and then chunked into fixed-length patches (like short words in a sentence). A small residual network turns each patch into a token embedding.
- Why this step exists: Different sensors and markets have different scales; normalization lets the model focus on shapes and timing, not absolute magnitudes. Patching shortens sequences and helps the model capture local patterns.
- Example: A temperature series in Celsius and a sales series in dollars are both scaled to comparable ranges; a 16-point patch might cover 16 minutes of readings.
Step 2: Transformer backbone with TimeMoE
- What happens: A decoder-only Transformer stack encodes the context. Each block uses attention and then a sparse Mixture-of-Experts (TimeMoE). Only a couple of experts run per token (for speed), chosen by a router. A mild balancing loss keeps all experts active over time.
- Why this step exists: Time series are diverse; having many specialists that get picked only when needed makes the model both powerful and efficient.
- Example: In a 24-block main stack, a token representing a spiky sensor region might route to an “edge-case” expert and a “short burst” expert, while a smooth token routes to “seasonal” experts.
Step 3: Serial-Token Prediction with TimeSTP blocks
- What happens: After the main stack, a sequence of TimeSTP blocks refines future steps serially. Each TimeSTP block takes two inputs: (a) the previous block’s embeddings and (b) the original input embeddings (a steady anchor). It fuses them, runs a TimeMoE, and outputs a farther-horizon prediction. Deeper blocks handle farther steps. All these blocks are kept for inference.
- Why this step exists: It injects the missing serial computations without rolling the model multiple times. Deeper horizons depend on shallower ones; TimeSTP respects that chain inside one pass.
- Example: If you need 17 future patches, you use the last token from the main stack for the first patch and then the next 16 TimeSTP blocks for patches 2–17, all in one forward pass.
Step 4: Quantile forecasting head
- What happens: A shared head turns each relevant embedding (typically the last-token at each depth) into multiple quantiles for its assigned horizon patch.
- Why this step exists: Many real tasks need probabilities and safety margins, not just single numbers.
- Example: Energy planners might look at the 10th, 50th, and 90th percentiles to decide base load vs. peaker plants.
Step 5: Pre-training on TimeBench
- What happens: Timer-S1 is first trained on TimeBench (over a trillion points) using a dense objective: next-token for short steps and STP for serial multi-steps. Every series contributes many training tasks: different input/output lengths, various horizons.
- Why this step exists: Diverse, dense supervision helps the model learn a general forecasting skill that transfers widely.
- Example: The same weather series yields tasks where the model sees 2 days to predict 1 hour, 1 day to predict 1 day, etc.
Step 6: Post-training for short-term boost and long context
- What happens: Continued pre-training focuses more weight on shorter horizons to sharpen near-term accuracy (which also supports long-term steps). Data is mixed between TimeBench and short-term-focused sets to prevent overfitting. The context window is extended from 2,880 points to 11,520 using positional embedding scaling.
- Why this step exists: Short- and long-term tasks pull in slightly different directions. A dedicated stage lets the model polish short-term skill and handle longer histories.
- Example: Weights for horizons might be strongest at the first step and gradually smaller for farther steps; context 11.5K lets the model see weeks of history if each point is minutes apart.
Step 7: Efficient training and serving
- What happens: Training uses a distributed framework, mixed-precision compute, and sharded data loading to keep I/O and memory efficient. Inference is fast: multi-step forecasts arrive in a single pass using the TimeSTP chain.
- Why this step exists: Billion-scale models and trillion-point datasets are heavy; careful engineering makes them practical.
- Example: Data is stored in 50MB shards, queued in memory for fast sliding-window sampling.
The Secret Sauce: Serial Scaling
- Serial architecture: Keep the serial reasoning blocks (TimeSTP) for inference so training and testing match.
- Serial data: Train on a huge, cleaned, and augmented corpus that covers many speeds and shapes.
- Serial training: Use a two-stage plan—first learn across all horizons, then sharpen the first steps and expand context—so the model grows strong where it matters most and stays stable over long ranges.
New concept 9 — Putting it all together with an example 🍞 Hook: Think of forecasting store sales for the next 17 hours. 🥬 The Concept: Use the past history (packed into patches), the TimeMoE stack for context, then serial TimeSTP blocks to produce hour 1, then hour 2, … up to hour 17, all at once.
- How it works:
- Normalize past sales; break into patches.
- Run through TimeMoE backbone to get rich context.
- Use TimeSTP block 1 for hour 2, block 2 for hour 3, …
- Quantile head gives uncertainty bands for each hour.
- Why it matters: You avoid slow, error-prone rolling and still respect that hour 3 depends on hours 1–2. 🍞 Anchor: The manager sees a whole day’s forecast with confidence ranges in one shot and can schedule staff accordingly.
04Experiments & Results
The test: Timer-S1 is evaluated on GIFT-Eval, a large, diverse benchmark (24 datasets, 144,000 series, 177M points). Two scores matter most: MASE (accuracy compared to a naive seasonal baseline) and CRPS (quality of the full probabilistic forecast). This combination tells us both “How close is your point forecast?” and “How trustworthy is your uncertainty range?”
The competition: Strong statistical baselines (like AutoARIMA), recent deep models, and leading time-series foundation models (like Chronos-2, TimesFM-2.5, Toto-1.0, TiRex, and Sundial) are included. These are well-known, tough baselines.
The scoreboard with context:
- Timer-S1 achieves MASE 0.693 and CRPS 0.485, taking the top spot among pre-trained models.
- Interpreting MASE: 0.693 is like getting an A when most models are hovering around B territory—clearly better than the baseline.
- Interpreting CRPS: 0.485 means the model’s full distributional forecasts (its uncertainty bands) are tighter and more accurate—like shooting arrows that cluster nearer the bullseye, not just one good arrow.
By horizon length (short, medium, long):
- Timer-S1 shines more as horizons get longer. This lines up perfectly with the design goal: inject serial computations that matter increasingly with distance into the future.
- In medium and long ranges, Timer-S1 consistently outperforms top peers, showing fewer drifts and better-calibrated uncertainty.
What changed across training stages:
- Pre-training only (PT): solid but not best.
-
- Continued pre-training (CPT): noticeable gains in accuracy and uncertainty.
-
- Long-context extension (LCE): further gains, culminating in the best overall scores. This confirms that multi-stage training is worth it.
Serial vs other training objectives:
- Next-token prediction (NTP): Faithful to serial logic but needs many rolls; timer-wise, it is slow and accrues errors.
- Multi-token prediction (MTP): Single pass for many steps but weakens serial dependencies; accuracy suffers on longer ranges.
- Serial-Token Prediction (STP): Keeps the serial chain and still runs in one pass; in experiments, STP beats both NTP and MTP on MASE and CRPS.
Speed and efficiency:
- For multi-step outputs, Timer-S1 runs the TimeSTP stack once. NTP must roll repeatedly. MTP predicts at once but can waste compute on unneeded horizons and lacks serial refinement. In measured inference (same backbone), Timer-S1 is faster than repeated NTP and more compute-aligned than MTP.
Scaling analysis:
- Varying TimeMoE depth (experts/backbone) and TimeSTP depth (serial horizon blocks): results improve up to the chosen configuration (24 TimeMoE + 16 TimeSTP blocks), indicating healthy scaling at billion-parameter size while keeping per-token activation low.
Ablations and insights:
- Keep TimeSTP at inference: If you remove these blocks after training (and switch to rolling), accuracy drops—training and inference must match.
- Avoid using shifted future inputs during training: That creates a train–test gap and hurts performance.
- Data augmentation helps: Resampling and value-flipping reduce bias (e.g., not over-assuming up-trends or a single dominant frequency). On synthetic sinusoids, resampling notably improves robustness across frequencies.
- Pre-training matters: Training from scratch on the post-training data alone underperforms significantly compared to using the TimeBench-pre-trained model.
Surprises:
- A small error spike around the configured patch size on certain sinusoids suggests future improvements by making patching more adaptive.
- Even though the model is trained in a univariate format, it transfers strongly across many datasets—evidence that learning robust temporal patterns pays off widely.
05Discussion & Limitations
Limitations:
- No native exogenous covariates yet: The model currently focuses on univariate context at pre-training time. Many real tasks benefit from outside signals (promotions, holidays, weather for sales), so adding covariates cleanly remains an open challenge.
- Multivariate structure: While the approach generalizes well, truly leveraging cross-variable interactions at scale is hard due to heterogeneous data and noise.
- Representation tension across horizons: Short-term sharpness vs. long-term stability can pull the backbone in different directions; two-stage training helps but may not fully resolve all trade-offs.
- Patch-size sensitivity: Fixed patch lengths can create tiny blind spots (e.g., certain frequencies). More adaptive patching could help.
- Compute and data needs: Billion-scale models and trillion-point corpora demand serious infrastructure; this won’t fit everyone’s budget.
Required resources:
- GPUs/TPUs capable of distributed training, high-throughput storage (multi-terabyte), and efficient data loaders.
- Mixed-precision training and sharded datasets for practicality.
- For deployment, memory for MoE routing and TimeSTP inference is needed, though per-step activations are kept small (only a subset of experts fire).
When NOT to use:
- Extremely tiny datasets with narrow domains where a small classical model is sufficient and easier to maintain.
- Hard real-time microcontroller settings where even sparse MoE may be too heavy and latency must be ultra-low.
- Tasks dominated by exogenous shocks that the model never sees at training time and that can’t be inferred from the history alone.
Open questions:
- Best way to add exogenous covariates without breaking the serial logic and training stability.
- Adaptive patching or multi-scale tokenization that avoids frequency blind spots.
- More unified training for short vs. long horizons so both excel without multi-stage complexity.
- Calibration and decision-focused training: further improving how quantiles translate into better real-world choices.
- Extending to richer agent systems that reason over text, events, and time together for end-to-end planning.
06Conclusion & Future Work
Three-sentence summary: Timer-S1 is a billion-scale time-series foundation model that adds serial reasoning into the network itself using Serial-Token Prediction and keeps those serial blocks during inference. Trained on a trillion-point TimeBench corpus with bias-reducing augmentation and finished with a focused post-training stage, it achieves state-of-the-art scores on GIFT-Eval, especially at medium and long horizons. This shows that respecting the serial nature of forecasting while scaling architecture, data, and training in concert breaks the previous bottleneck.
Main achievement: Demonstrating that Serial Scaling—especially TimeSTP blocks and STP—unlocks both accuracy and efficiency for long-range forecasts, outperforming next-token and multi-token strategies while keeping compute practical.
Future directions: Incorporate exogenous covariates robustly, develop adaptive patching/multi-scale tokenization, unify short- and long-horizon representation learning, and integrate the model as a reasoning component in agentic systems handling text, events, and multimodal signals.
Why remember this: It reframes long-horizon forecasting from “roll many times or guess all at once” to “compute the chain inside the model, once,” showing a clear, scalable path to better and faster forecasts that matter in energy, health, logistics, finance, and beyond.
Practical Applications
- •Energy demand planning: Predict base load and peaks with quantiles to dispatch generators efficiently.
- •Retail workforce and inventory: Staff shifts and stock levels using short- and long-horizon forecasts with uncertainty bands.
- •Predictive maintenance for factories and IoT: Anticipate machine failures and sensor drifts to schedule upkeep before breakdowns.
- •Healthcare operations: Forecast patient arrivals and bed occupancy to optimize staffing and resources.
- •Transportation and traffic: Plan routes, congestion control, and public transit schedules with medium- and long-horizon visibility.
- •Finance and risk: Anticipate volatility ranges for asset management, hedging, and liquidity planning.
- •Cloud and IT capacity: Forecast server load and network traffic to right-size infrastructure and autoscaling policies.
- •Climate and weather-adjacent signals: Plan for temperature swings or demand spikes related to weather-sensitive industries.
- •Supply chain and logistics: Coordinate shipping, warehousing, and last-mile delivery across weeks with fewer stockouts.
- •A/B testing and growth analytics: Forecast metric trajectories under uncertainty to decide experiment duration and risk.