veScale-FSDP: Flexible and High-Performance FSDP at Scale

Zezhou Wang; Youjie Li; Zhiqi Lin; Jiacheng Yang; Cong Xie; Guanyu Feng; Zheng Zhong; Ziyue Huang; Hongyu Zhu; Zhi Zhang; Yanghua Peng; Xin Liu

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Intermediate

Zezhou Wang, Youjie Li, Zhiqi Lin et al.2/25/2026

arXiv

Key Summary

•This paper makes training giant AI models faster and lighter on memory by inventing a new way to split tensors called RaggedShard.
•RaggedShard lets tensors be cut into neat blocks (like 32×32) so block-wise quantization and matrix-style optimizers work naturally without hacks.
•A planning algorithm smartly packs these blocks into communication buffers to avoid wasteful padding and slow copies.
•A new Distributed Buffer (DBuffer) gives zero-copy communication and batched memory management, which reduces fragmentation and speeds things up.
•Compared to popular systems (DeepSpeed ZeRO, PyTorch FSDP1/2, Megatron-FSDP), veScale-FSDP is 5–66% faster and uses 16–30% less memory.
•It scales cleanly to tens of thousands of GPUs and even trains 2.4T-parameter sparse models on 1K GPUs with strong efficiency.
•It natively supports structure-aware training like 8-bit Adam and non-element-wise optimizers like Muon, with simple, PyTorch-native code.
•The planner’s NP-hard problem is handled with practical heuristics that run under 0.3 seconds, keeping startup quick.
•Small-scale profiling accurately predicts large-scale behavior, making capacity planning and cost estimation easier.
•The approach composes with other parallelisms (like expert parallel) and standard DTensor checkpointing for robust, practical deployment.

Why This Research Matters

Training bigger, smarter models often hits two walls: speed and memory. By matching shards to the model’s natural blocks, veScale-FSDP lets modern tricks like block-wise 8-bit optimizers and Muon run easily and efficiently. Faster throughput and lower memory mean the same job finishes sooner or needs fewer GPUs—cutting costs and saving energy. The method scales cleanly to very large GPU counts, so teams can plan big runs with confidence. Because it’s PyTorch-native and composes with existing parallelism and checkpointing, adoption is straightforward. This lowers the barrier to state-of-the-art training, speeding up research and production alike.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you and your friends are building a giant Lego castle. If each friend grabs random bricks from a messy pile, you waste time searching and fixing pieces that don’t fit. But if each friend gets tidy boxes of exactly the right brick shapes, the castle rises fast.

🥬 The Concept (Fully Sharded Data Parallel, FSDP): FSDP is a way to split a huge neural network across many GPUs so training fits in memory and runs fast. How it works: 1) Chop model tensors (weights, gradients, optimizer states) into pieces, 2) Give each GPU its shard, 3) Only bring full tensors together when you need to compute, 4) Share gradients back by scattering. Why it matters: Without sharding, big models don’t fit on one GPU and training stalls or crashes.

🍞 Anchor: Training a 70B-parameter model? FSDP lets 1,024 GPUs each hold a slice so the full model lives across the cluster like a giant Lego castle built by many hands.

🍞 Hook: You know how some recipes say “mix these 2 cups as a group” instead of “mix 1 spoon at a time”? That’s because the structure matters.

🥬 The Concept (Structure-aware training): Structure-aware training pays attention to how tensors are shaped (matrices and blocks), not just single numbers. How it works: 1) Keep tensors in their natural 2D blocks, 2) Do math on whole blocks (like matrix updates or block-wise quantization), 3) Align data placement with those blocks. Why it matters: If you split tensors randomly, block math breaks or needs slow fixes.

🍞 Anchor: If a model wants $32×32$ blocks for quantization, but shards cut through the middle of those blocks, you must copy and fix pieces every step—slow and messy.

🍞 Hook: Picture an orchestra: you don’t just tune one violin string—you adjust whole sections together.

🥬 The Concept (Non-element-wise optimizers): Optimizers like Shampoo or Muon update weights using whole matrices, not one element at a time. How it works: 1) Collect a full matrix on a chosen GPU, 2) Do matrix math (like Newton–Schulz), 3) Send updates back to the right shards. Why it matters: If shards slice matrices in awkward ways, these optimizers can’t run efficiently.

🍞 Anchor: Muon needs the entire 2D weight to do its preconditioning step; if the matrix is scattered into tiny mismatched bits, you can’t tune the “orchestra section.”

🍞 Hook: Think of compressing a big book into neat chapter summaries instead of cutting words at random.

🥬 The Concept (Block-wise quantization): It shrinks tensors by compressing them block by block (for example, $32×32$ ), keeping accuracy while saving memory. How it works: 1) Split weights or optimizer states into fixed-size blocks, 2) Compute scales per block, 3) Store in fewer bits (e.g., 8-bit), 4) Dequantize when needed. Why it matters: If shard boundaries don’t match block boundaries, you need extra communication and copies.

🍞 Anchor: 8-bit Adam stores statistics in INT8 per $32×32$ block; aligned blocks mean no cross-GPU coordination.

The world before: FSDP/ZeRO made big models trainable, but mainly used element-wise or evenly row-wise sharding. That was fine for simple, element-wise optimizers. The problem: modern structure-aware training (block-wise quantization, Muon/Shampoo) needs clean 2D blocks or whole matrices. Old sharding splits right through those blocks, forcing extra padding, boundary checks, slow copies, and more communication. Failed attempts: 1) DeepSpeed ZeRO: fragmented all-gathers and memory management issues; 2) PyTorch FSDP1: faster but still element-wise limits; 3) PyTorch FSDP2: per-parameter even sharding with DTensor helps flexibility, but causes interleaved copy-in/out and alignment problems; 4) Megatron-FSDP: speeds up by concatenating, but adds padding to fake even shards, growing memory and communication—and still not block-aware. The gap: a way to shard tensors at arbitrary, user-picked block sizes, while also packing them for peak communication speed and low memory. Real stakes: Better throughput and lower memory means training larger, smarter models faster, on fewer GPUs, saving time, money, and energy—and enabling cutting-edge techniques that improve quality.

02Core Idea

🍞 Hook: You know how packing a suitcase is easiest when your clothes are folded into tidy cubes that exactly fit the shelves? If your shelves match your cube sizes, everything slides in perfectly with no wasted space.

🥬 The Concept (The “Aha!”): Make sharding match the model’s natural blocks, then plan and pack those blocks so communication is fast and memory is tidy. How it works: 1) Introduce RaggedShard to express any block size and uneven splits per device, 2) Plan the layout so blocks land contiguously and evenly across GPUs with minimal padding, 3) Store and move data with a zero-copy Distributed Buffer (DBuffer). Why it matters: Without this trio, structure-aware training either breaks or pays big penalties in copies, padding, and fragmentation.

🍞 Anchor: Instead of slicing pizza randomly, we cut perfect squares that match the box grid; then we stack boxes so delivery trucks (collectives) run full and fast.

Three analogies:

Lego bins: RaggedShard groups bricks by exact shape; the planner arranges bins so each table gets the right bricks; DBuffer is the rolling cart that moves bins without repacking.
Library shelves: RaggedShard makes every book block-fit; the planner orders shelves to avoid gaps; DBuffer is a bookcase on wheels so you don’t re-shelve to move.
Orchestra seating: RaggedShard keeps instrument sections intact; the planner balances sections across the stage; DBuffer is the riser that rolls sections together without rewiring.

Before vs After:

Before: Even or element-wise shards cut through 2D blocks, causing misalignment, padding growth, and extra copies; non-element-wise optimizers needed special, intrusive logic.
After: Blocks are first-class citizens. 8-bit Adam aligns perfectly with $32×32$ blocks; Muon can gather a full matrix to one rank naturally; communication uses big, contiguous, aligned buffers with zero-copy.

Why it works (intuition, not equations):

Matching boundaries: When shard edges line up with block edges, per-block math stays local—no cross-GPU stitching.
Balanced packing: If each GPU gets nearly the same total bytes and aligned chunks, ring-based collectives keep the network saturated.
Zero-copy paths: If what you communicate is already laid out like what you compute, you skip costly copy-in/out.
Batched memory: Fewer, larger allocations reduce fragmentation and peak reserved memory.

Building blocks (with mini “sandwich” explanations):

🍞 Hook: Imagine cutting brownies exactly on the grid lines. 🥬 RaggedShard: A DTensor placement that shards tensors in user-chosen atomic blocks (rows or 2D tiles) with uneven amounts per device; steps: pick block size, assign blocks to devices, keep them contiguous; why: non-element-wise ops and quantization then “just work.” 🍞 Anchor: $32×32$ quantization blocks stay intact on each GPU—no cross-talk.
🍞 Hook: Think Tetris—fit shapes to fill rows perfectly. 🥬 Planning algorithm: It permutes and places tensors into a global buffer so block edges align with shard edges, loads are balanced, and padding is tiny; steps: try smart orders, compute minimal per-GPU buffer size S that keeps tensors contiguous and non-sharded within blocks; why: misfit shapes cause wasted space and slowdowns. 🍞 Anchor: For MoE weights at $128×128$ blocks, padding stays under a few percent in many scales.
🍞 Hook: A toolbox drawer that fits your tools so you never repack them. 🥬 DBuffer: A multi-dimensional global buffer with group-level ops and zero-copy address mapping; steps: map tensor slices to slots, run fused kernels, communicate in place; why: avoids interleaved copies and reduces fragmentation. 🍞 Anchor: AllGather runs on pre-packed bytes, then compute picks them up in place.

In short, the key idea is to respect structure (RaggedShard), pack it smartly (planner), and move it efficiently (DBuffer).

03Methodology

At a high level: Model tensors and optimizer states → (A) Choose RaggedShard granularity → (B) Plan grouped communication layout → (C) Map to DBuffer (zero-copy slots) → (D) Train: AllGather/ReduceScatter with in-place, fused ops → (E) Structure-aware steps like 8-bit Adam or Muon run naturally.

Step A: Choose sharding granularity with RaggedShard

What happens: For each parameter, pick an atomic block (e.g., 1 row, 16 rows, or $32×32$ tiles). Assign an uneven number of these blocks to each GPU if needed (ragged amounts). Compose RaggedShard with other DTensor placements like Shard(0) for expert or tensor parallelism.
Why this exists: Non-element-wise ops and block quantization need clean block boundaries. If we used only even, row-wise sharding, blocks could be cut in half.
Example: A $4096×4096$ weight uses $32×32$ blocks (that’s $128×128$ tiles). GPU0 gets 8 tiles, GPU1 gets 9, etc., based on load.

Step B: Plan grouped communication layout (the NP-hard part solved with heuristics)

What happens: The planner orders tensors (e.g., default, by block size, or by shape), then computes the smallest per-GPU buffer size S such that: 1) each tensor is contiguous in memory, 2) shard edges never cut through a block, 3) per-GPU loads are balanced, and 4) buffers align to collective-friendly sizes.
Why this exists: Naive concatenation causes sharded blocks, non-contiguous tensors, and uneven loads—slowing NCCL collectives and forcing copy-in/out.
Example: Three tensors with block sizes 32 rows, 16 rows, and $32×32$ tiles. The planner picks an order and S so every $k·S$ boundary is on a block edge; minimal padding lands only between tensors, not inside them.

Step C: Map planned layout to DBuffer

What happens: The planned byte ranges for each tensor are bound to slots in a multi-dimensional DBuffer over the device mesh. DBuffer stores a persistent address mapping so collectives read/write directly.
Why this exists: To skip copies before/after communication and to fuse per-tensor kernels (add/scale/zero) into group-level ops that don’t stall NCCL.
Example: Before AllGather, DBuffer fuses tiny ‘scale’ kernels for 100 tensors into a single launch; after AllGather, tensors are already in place for compute.

Step D: Training loop with in-place, zero-copy collectives

What happens: For each FSDP unit (e.g., a layer): 1) AllGather parameters from shards to full tensors (in DBuffer layout), 2) Forward compute, 3) Backward compute, 4) ReduceScatter gradients back to shards, 5) Optimizer updates, possibly structured.
Why this exists: Keep communication overlapped with compute, and ensure the communication buffers exactly match compute needs to avoid interleaved copies.
Example: FSDP2 pays 10–20 ms per step in copy-in/out; here those disappear because the DBuffer memory is already compute-ready.

Step E: Structure-aware optimizers and quantization

8-bit Adam (block-wise quantization)
- What happens: Per-parameter policy sets $32×32$ blocks; each GPU quantizes its local blocks independently (no extra collectives) because blocks align with shards.
- Why: Misaligned blocks would require gathering metadata or states across GPUs.
- Example: A $32×32$ moment block gets scaled and compressed to INT8 locally, saving memory while keeping accuracy.
Muon (matrix preconditioning)
- What happens: Use DTensor redistribute with RaggedShard to unshard a full 2D matrix onto a chosen root GPU, run Newton–Schulz there, then scatter updates back.
- Why: Ensures full-matrix math occurs where the whole matrix resides—no awkward cross-GPU stitching.
- Example: One weight matrix per step is gathered to a root rank chosen for load balance, updated, then sent back to shards.

Secret sauce (why the trio is clever together):

RaggedShard makes blocks first-class citizens, so structure-aware math is natural.
The planner keeps buffers contiguous, aligned, and balanced, which saturates the network with minimal padding.
DBuffer turns planned layouts into reality with zero-copy communication and fused kernels, cutting stalls and fragmentation.
Together, they transform the communication+memory backbone so that model math flows without detours.

04Experiments & Results

The test: Measure end-to-end training speed (tokens/sec), peak GPU memory, and scalability; also check if structure-aware methods (8-bit Adam, Muon) are easy and efficient. Why these metrics: Throughput = how fast you learn; memory = how big a model you can fit; scaling = how well more GPUs help; structure-aware success = can we use modern tricks without hacks.

Competition: DeepSpeed ZeRO, PyTorch FSDP1, PyTorch FSDP2, and Megatron-FSDP. All use comparable settings (ZeRO-3, mixed precision).

Scoreboard with context:

Overall speed: veScale-FSDP is 5–66% faster. Think of a class where most kids get a B; veScale-FSDP gets an A and often an A+.
Memory: 16–30% lower peak reserved memory across benchmarks. That’s like fitting the same luggage into a smaller suitcase, with room to spare.
On LLaMA-3-70B: about 5% faster than DeepSpeed/FSDP1/FSDP2 and slightly faster than Megatron-FSDP—this is significant at that scale because a few percent saves a lot of GPU-hours.
On MoE models: 11–66% faster due to less padding, better overlap, and zero-copy collectives. Megatron’s fixed padding inflated buffers by ~33% in these cases, slowing collectives and raising memory.
Scaling: Near-linear weak scaling from 1K to 8K GPUs for 800B-parameter MoE; strong scaling remains linear up to 10K GPUs at large global batches. Trains up to 2.4T parameters on 1K GPUs with high MFU, thanks to DBuffer memory efficiency.
Structure-aware success: 8-bit Adam runs locally per shard (aligned $32×32$ blocks) with matching convergence to DDP’s baseline behavior; Muon cleanly gathers full matrices to a root rank, converging faster than AdamW in tests.

Surprising findings and gotchas:

FSDP2 sometimes OOMs when group sizes force even splits plus padding (e.g., GPT-OSS at 256 GPUs), effectively doubling buffers.
Interleaved copy-in/out in per-parameter DTensor designs can eat up to ~14% of an iteration—veScale-FSDP’s zero-copy path avoids this tax.
Small-scale profiling predicted large-scale performance well; because FSDP’s compute stays local, communication costs don’t balloon unpredictably.
Planner runtime is under 0.3s—a negligible, one-time startup cost.

Bottom line: Cleaner block alignment + smart packing + zero-copy buffers = meaningfully faster, leaner training—especially for MoE and structure-aware methods.

05Discussion & Limitations

Limitations:

Padding spikes can still occur at specific shard counts when collective alignment and block granularities clash (LCM effects), especially with large tilings like $128×128$ .
The planning problem is NP-hard; while the heuristic is fast and near-optimal in practice, it’s not a formal global optimum.
Peak gains depend on high-quality collective libraries and alignment (e.g., NCCL) and may vary on different interconnects.
Very irregular models with many unique block sizes could reduce packing efficiency.

Required resources:

Multi-GPU nodes with a fast interconnect, a PyTorch version with DTensor, and NCCL or equivalent collective support.
Enough host and device memory to hold planned DBuffer slots and fused kernels.

When not to use:

Small models on a handful of GPUs where plain DDP already saturates compute—overhead savings may be negligible.
Workloads that never use structure-aware methods and already have well-tuned FSDP; the extra flexibility won’t shine.

Open questions:

Automatic block-size selection: can the system pick per-parameter granularities from profiles to minimize padding and maximize locality?
Dynamic re-planning: can we adjust layouts on the fly as batch sizes, sequence lengths, or expert routing patterns change?
Generalization: how does this extend to other modalities (vision/audio) with 3D/4D tilings and different operator mixes?
Cross-vendor support: how robust is performance on alternative interconnects or collective libraries?
Fault tolerance: can we integrate planner decisions with checkpointing to recover mid-iteration without extra overhead?

06Conclusion & Future Work

Three-sentence summary: veScale-FSDP makes structure-aware training first-class by introducing RaggedShard (block-aligned sharding), a planning algorithm (smart packing with minimal padding), and DBuffer (zero-copy group communication). Together, they boost throughput by 5–66% and cut memory by 16–30% versus leading FSDP systems, while scaling cleanly to tens of thousands of GPUs and multi-trillion-parameter models. They also make 8-bit Adam and Muon easy to deploy with PyTorch-native code.

Main achievement: Unifying flexibility (arbitrary block granularity) and performance (aligned, balanced, zero-copy communication) in a single, practical FSDP system.

Future directions: Automate block-size choices, add dynamic re-planning as workloads shift, broaden support for diverse modalities and interconnects, and deepen integration with distributed checkpointing and compilation to push MFU even higher.

Why remember this: It shows that respecting tensor structure isn’t just elegant—it’s faster and cheaper at scale. By aligning shards to blocks, planning for balance, and removing copies, we unlock the newest optimizers and quantization methods without hacks, making giant-model training both simpler and stronger.

Practical Applications

•Train LLMs with block-wise 8-bit Adam to reduce optimizer memory while preserving accuracy.
•Use Muon or Shampoo efficiently by gathering full matrices on chosen ranks without custom collectives.
•Scale MoE models to thousands of GPUs with minimal padding and balanced communication.
•Cut cloud costs by improving throughput (5–66%) and reducing peak memory (16–30%).
•Avoid OOMs at larger FSDP sizes by eliminating interleaved copies and padding blow-ups.
•Compose with expert/tensor parallelism using DTensor placements for complex, multi-dimensional sharding.
•Speed up training initialization with a planner that finishes in under a second for typical jobs.
•Leverage zero-copy collectives to remove 10–20 ms per-iteration copy overhead common in per-parameter designs.
•Use standard PyTorch checkpointing with sharded saves and no extra glue code.
•Profile on small GPU counts to predict large-scale performance and plan capacity reliably.

Version: 1