ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

Zihan Yang; Shuyuan Tu; Licheng Zhang; Qi Dai; Yu-Gang Jiang; Zuxuan Wu

ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

Intermediate

Zihan Yang, Shuyuan Tu, Licheng Zhang et al.2/9/2026

arXiv

Key Summary

•ArcFlow is a new way to make text-to-image models draw great pictures in only 2 steps instead of 50, giving about a 40× speed boost.
•The trick is to follow a smooth curvy path (non-linear flow) from noise to picture, instead of a straight shortcut that misses the teacher model’s twists and turns.
•ArcFlow models how the drawing speed (velocity) changes over time using ‘momentum,’ like a rolling ball that keeps moving and gradually changes pace.
•It blends several momentum “modes” together, then uses an exact math formula to jump across time without shaky approximations.
•Because its path matches the teacher’s curved path better, ArcFlow keeps image quality and variety high, even with very few steps.
•It only fine-tunes small add-on pieces (LoRA adapters), training less than 5% of the original parameters for fast and stable learning.
•On big models like Qwen-Image-20B and FLUX.1-dev, ArcFlow beats other 2-step methods on quality scores (lower FID/pFID) and keeps prompts well-aligned.
•Training converges faster and stays steadier than past methods that try to force straight-line paths or use unstable adversarial tricks.
•A known limit: pushing down to 1 step (1 NFE) still makes images blurry—2 steps are the sweet spot for now.

Why This Research Matters

Fast, faithful image generation makes creative tools feel instant: artists can iterate designs, styles, and layouts without waiting. Lower compute per image reduces cloud costs and energy use, helping more people access powerful visual AI. Games, AR/VR, and interactive storytelling benefit from rapid scene synthesis without losing detail or variety. Education and science apps can visualize ideas quickly while preserving realism and nuance. Because ArcFlow aligns closely with the teacher’s distribution, it maintains trustworthy textures and structure, not just prompt relevance. With careful policy and moderation, this efficiency boost expands what’s practical while keeping quality high.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how baking a cake takes time because you must do many little steps—mix, pour, bake, cool? Early text-to-image AIs worked the same way. They started with random noise and changed it bit by bit across 40–100 mini-steps to form a clear picture. That made amazing images, but it was slow—too slow for real-time apps.

🍞 Hook: Imagine drawing a picture by tracing over it 50 times, each pass making it a little cleaner. Great result, but wow, that’s a lot of tracing. 🥬 The Concept (Diffusion models): They are AI artists that start with noise and gradually “denoise” it into a picture.

What it is: A method that turns noise into images through many small refinements.
How it works: (1) Start with noise. (2) Predict how to nudge the picture a tiny bit cleaner. (3) Repeat 40–100 times. (4) You get a final image.
Why it matters: Without many steps, the picture stays fuzzy; with too many steps, it’s too slow. 🍞 Anchor: Stable Diffusion and other modern image models use this multi-step cleanup to make crisp art.

The big problem: Fast versions tried to squeeze those 50 steps into just 2–4 steps. Many methods “shortcut” the path the teacher model takes by drawing straight lines between far-apart times. But the teacher’s true path bends and curves; its direction keeps changing. A straight line misses those curves.

🍞 Hook: You know how a mountain road winds left and right? If your map draws a straight line from top to bottom, you’d fall off a cliff. 🥬 The Concept (Linear shortcut vs. curved path): A straight shortcut tries to replace a twisty road.

What it is: Approximating a curvy path with a straight line across big time jumps.
How it works: (1) Pick a start and end. (2) Move in a straight line. (3) Hope it stays close to the real road.
Why it matters: If the road actually curves a lot, you drift far away and end up off-target—blurry or wrong images. 🍞 Anchor: Prior few-step students lost detail and diversity because they couldn’t follow the teacher’s curves.

People tried many things:

Progressive distillation and rectified flow tried to slowly straighten the path. Still had errors when only a few big steps were allowed.
Consistency models mapped points directly, but needed heavy math tricks to stay stable.
Adversarial (GAN-like) losses improved sharpness, but training got wobbly and sometimes collapsed to repeating looks.
Gaussian mixture approaches approximated how speed changes, but weren’t exact enough in extreme 2-step cases.

🍞 Hook: Imagine learning to dance by watching a teacher. If you only copy their start and end poses with a straight move, you miss the flow of the moves in between. 🥬 The Concept (Teacher–student distillation): A smaller student learns to move like a skilled teacher.

What it is: Training a fast student to imitate a slow-but-expert teacher.
How it works: (1) Watch the teacher’s motion. (2) Practice matching their direction and timing. (3) Do it in fewer steps.
Why it matters: Good imitation keeps quality; bad imitation looks stiff or off-beat. 🍞 Anchor: A 2-step student that truly follows the teacher’s in-between moves keeps style, detail, and variety.

The missing piece—the gap—was respecting how the teacher’s direction changes over time within each big step. We needed a way to model this change smoothly, not as one flat shove.

Enter ArcFlow. It uses a physics idea: momentum.

🍞 Hook: Picture a rolling ball that doesn’t stop instantly. Its speed and direction change smoothly, not in sudden jumps. 🥬 The Concept (Momentum parameterization): Model how the “drawing speed” evolves over time.

What it is: A way to say, “If I was moving this fast at time A, I can predict my speed at time B smoothly.”
How it works: (1) Start from a base velocity. (2) Apply a momentum factor that smoothly scales it over time. (3) Combine multiple such modes for complex motion.
Why it matters: This captures the teacher’s curves, so the student doesn’t get lost. 🍞 Anchor: Instead of one straight push, ArcFlow predicts an arc—the speed and direction bend naturally across the step.

Finally, real stakes: Faster, high-quality image generation makes creative tools snappier, games and AR more responsive, and servers cheaper to run. But we also need to be careful about misuse (e.g., fake images), so safety policies still matter.

02Core Idea

The “Aha!” in one sentence: If the teacher’s path is curvy, don’t force a straight line—learn the curve itself by modeling how velocity evolves with momentum, then integrate it exactly in one go.

Three analogies (same idea, different lenses):

Driving analogy: Past methods tried to jump across a winding road with a ruler-straight shortcut; ArcFlow reads the road’s curve signs (momentum) and steers smoothly through the bends.
Music analogy: The teacher’s melody changes tempo and volume over time; ArcFlow learns those crescendos (momentum modes) and plays them in rhythm, not as a single flat note.
Sports analogy: A soccer ball’s path bends with spin; ArcFlow models the spin (momentum factors) so the 2-step kick still curves like the teacher’s multi-tap dribble.

Before vs. After:

Before: Few-step students marched in straight lines across long time gaps, missing the teacher’s changing directions. Quality and diversity suffered.
After: ArcFlow uses a mixture of smooth momentum processes to predict how velocity bends over time within each big step, then uses an analytic (exact) solver to land precisely where the teacher would.

Why it works (intuition, no equations):

The teacher’s generation is a smooth path where velocity at nearby times is related. Momentum says, “velocity now comes from velocity a moment ago—scaled predictably.”
ArcFlow predicts multiple basic velocities, each with its own momentum factor and a probability gate. Like mixing gentle, medium, and strong curves, it composes complex motion.
Because these curves are exponent-like in time, their integral has a closed-form formula. That means we can compute the jump across a time interval exactly, instead of stacking many small, approximate hops.

Building blocks (each as a sandwich):

🍞 Hook: You know how a weather map shows wind arrows everywhere? 🥬 The Concept (Velocity field):

What it is: A map saying, “at this place and time, move in this direction and speed.”
How it works: (1) Look at your current state. (2) The model predicts a velocity vector. (3) Follow it to update the image.
Why it matters: Wrong arrows send you off-course and blur details. 🍞 Anchor: At time t=0.7, the field might say “sharpen edges a bit and boost sky color,” nudging pixels that way.

🍞 Hook: Imagine several toy cars with different kinds of push—gentle, steady, turbo. 🥬 The Concept (Mixture of momentum modes):

What it is: Combine several basic velocities, each with its own momentum factor and a gate that says how much to use it.
How it works: (1) Predict K base velocities. (2) Predict K momentum factors (how each changes over time). (3) Predict K gating probabilities that sum to 1. (4) Blend them to get the final velocity.
Why it matters: One mode can’t capture all twists; a mix can model fine textures and big shapes together. 🍞 Anchor: One mode cleans low-frequency shapes, another sharpens high-frequency textures; together they match the teacher.

🍞 Hook: If you know a car’s speed changes smoothly, you can predict where it’ll be later without checking every second. 🥬 The Concept (Analytic integration):

What it is: An exact math jump from start time to end time using the momentum mixture.
How it works: (1) Use the predicted mixture at the start. (2) Plug into a closed-form formula. (3) Compute the exact displacement across the interval. (4) Land at the new time in one shot.
Why it matters: No step-by-step wobble; fewer steps, higher precision. 🍞 Anchor: ArcFlow reaches the same spot the teacher would after many tiny hops—using one exact leap per big step.

🍞 Hook: Swapping big heavy shoes for light clip-ons makes running easier. 🥬 The Concept (Parameter-efficient adapters, LoRA):

What it is: Small add-on pieces that gently steer a big model without changing its whole body.
How it works: (1) Freeze most weights. (2) Add low-rank adapters and a new head to predict modes, momentum, and gates. (3) Train only these small parts.
Why it matters: Faster, cheaper training that’s more stable. 🍞 Anchor: ArcFlow trains <5% of parameters yet matches the teacher closely.

Put together, these parts let ArcFlow learn the teacher’s curvy path and ride it smoothly in just two steps, keeping both fidelity and diversity high.

03Methodology

At a high level: Prompt + Noise → Predict momentum-mixture parameters at the start of a big interval → Use analytic integration to jump to the end of the interval → Repeat for the second interval → Final image.

Step-by-step recipe:

Inputs and setup

What happens: Feed the text prompt and the current noisy latent image into the (mostly frozen) backbone with small LoRA adapters and three tiny heads that predict: (a) basic velocities v_k, (b) momentum factors γ_k, and (c) gating probabilities π_k.
Why it exists: We need all three to build a flexible, curved velocity over time.
Example: For K=16 modes, the model outputs 16 v’s, 16 γ’s, and 16 π’s that sum to 1.

Build the velocity field as a mixture

What happens: Combine the K modes: velocity = Σ π_k · v_k · time_curve(γ_k, t). Here, time_curve is a smooth function of time controlled by γ_k (think “how much the push changes as time passes”).
Why it exists: One straight push isn’t enough; the mixture captures both slow bends and quick tweaks.
Example: If π_1 is high with a gentle γ_1, the step will mostly follow a slow, broad curve; small π_7 with a sharp γ_7 adds fine crisping of texture.

Analytic transition operator (the exact jump)

What happens: Instead of marching with tiny steps, ArcFlow uses a closed-form formula that integrates the mixture across the time interval [t_start, t_end] in one computation.
Why it exists: Tiny steps are slow and introduce approximation error; exact jumps keep us on the teacher’s path with few NFEs.
Example: From t=1.0 to t=0.5, compute a single displacement Δx via the analytic formula, then set x_0.5 = x_1.0 − Δx.

Mixed trajectory integration during training (teacher-to-student handoff)

What happens: For each big interval, early in training we let the teacher guide the first part of the jump, then let the student finish the rest. Over time, the student takes over more.
Why it exists: Keeps the student on the teacher’s manifold early, avoids drifting, and speeds up stable learning.
Example: If the interval is [0.8 → 0.6], we may let the teacher guide [0.8 → 0.7] and the student finish [0.7 → 0.6]. As training progresses, the mix point slides so the student does more.

Instantaneous velocity matching (the learning signal)

What happens: At sampled times and latents along the interval, the student’s current velocity (from the mixture) is matched to the teacher’s velocity by minimizing their difference.
Why it exists: Matching the tangent (direction and speed) forces the student’s curve to align with the teacher’s curve.
Example: At t=0.72 with latent x, compute student velocity vs(x,t) and teacher velocity vt(x,t); train to make vs ≈ vt.

Repeat for just two intervals at inference

What happens: In generation, we do two big analytic jumps (2 NFEs). Start at pure noise (t≈1), predict mixture, jump to the middle (e.g., t≈0.5). Predict again, jump to t=0 to get the image.
Why it exists: Two precise leaps mimic dozens of tiny hops.
Example: From 50 steps down to 2, images stay sharp and well-aligned with prompts.

The secret sauce (each as a sandwich):

🍞 Hook: Like planning a hike by tracing a smooth path on the map instead of connecting peaks with straight lines. 🥬 The Concept (Non-linear flow within each step):

What it is: Curvy motion encoded inside a single big step via momentum mixtures.
How it works: (1) Predict v_k, γ_k, π_k once at the step’s start. (2) Use the analytic formula to integrate to the end. (3) No re-evaluations needed inside the interval.
Why it matters: Follows the teacher’s bend without micro-steps. 🍞 Anchor: One evaluation per interval still captures the teacher’s changing direction inside that interval.

🍞 Hook: If one violin can’t play all parts of an orchestra, use a whole ensemble. 🥬 The Concept (Multiple momentum modes):

What it is: Several time-behaviors blended so each handles a different frequency/detail.
How it works: (1) Some modes decay slowly (broad shapes), some quickly (fine textures). (2) Gating picks the right mix per image.
Why it matters: Richer motion → closer teacher match. 🍞 Anchor: Landscapes need smooth hills; fabrics need fine weave—different modes cover both.

🍞 Hook: Use a calculator’s exact formula instead of rough mental math. 🥬 The Concept (Analytic solver):

What it is: A closed-form computation that exactly integrates the mixture.
How it works: (1) Plug in γ_k, v_k, π_k at the start. (2) Compute Δx with a stable formula that even handles γ≈1 gracefully. (3) Update the latent.
Why it matters: Precision without step-by-step errors. 🍞 Anchor: Fewer steps, cleaner landings, better images.

🍞 Hook: Clip-on training wheels for a bike, not a full frame rebuild. 🥬 The Concept (LoRA adapters for parameter efficiency):

What it is: Train a tiny fraction of weights to steer a huge model.
How it works: (1) Keep the backbone frozen. (2) Train low-rank adapters and an output head that predicts the momentum mixture. (3) Converge quickly.
Why it matters: Saves compute and avoids destabilizing the teacher. 🍞 Anchor: ArcFlow fine-tunes <5% of parameters yet reaches teacher-like quality in 2 steps.

Concrete mini example:

Suppose at the start of a step, π_1=0.6 (broad), π_2=0.3 (medium), π_3=0.1 (fine). γ_1≈1 (almost linear), γ_2>1 (accelerating), γ_3<1 (decelerating). The solver blends these to move the latent cleanly to the next time, keeping large shapes steady while crisping edges—just like the teacher would over many tiny moves.

04Experiments & Results

The test: Can ArcFlow, in only 2 steps (2 NFEs), match or beat the image quality and prompt-following of big teacher models that usually need ~50 steps?

🍞 Hook: It’s like finishing a jigsaw puzzle in two big moves instead of fifty tiny ones—and still getting every piece right. 🥬 The Concept (NFEs – Number of Function Evaluations):

What it is: How many times the model is run per image.
How it works: (1) Traditional: 40–100 runs. (2) ArcFlow: just 2 runs. (3) Less runs = faster images.
Why it matters: Huge speedups unlock real-time uses. 🍞 Anchor: 40–100 → 2 NFEs gives about 40× speed boost over the original teacher pipeline.

Benchmarks and baselines:

Datasets: Geneval (object alignment), DPG-Bench (long, dense prompts), OneIG-Bench (varied aspects), Align5000 (distributional fidelity vs. teacher).
Teachers/backbones: Qwen-Image-20B and FLUX.1-dev.
Competitors (all at 2 steps): pi-Flow, TwinFlow, SenseFlow, Qwen-Image-Lightning.

Scores with context:

FID/pFID (lower is better) compare distributions to the teacher’s images. ArcFlow gets the lowest FID and pFID across both Qwen and FLUX students, meaning its pictures statistically look most like the teacher’s 50-step results.
CLIP/prompt alignment: ArcFlow keeps strong text-image matching, comparable to or better than other 2-step baselines.
Diversity: Some adversarial baselines improved sharpness but lost variety (mode collapse). ArcFlow preserves diversity notably better (e.g., big gains on OneIG-Bench diversity), because it follows the teacher’s path instead of forcing a linear shortcut.
Convergence: ArcFlow reaches good FID faster and with fewer wobbles during training compared to pi-Flow and TwinFlow.

🍞 Hook: Grading tests isn’t just about one score; you need multiple subjects. 🥬 The Concept (FID and pFID):

What it is: Numbers that say how close your image distribution is to the teacher’s distribution (overall and at patch level).
How it works: (1) Extract features from images. (2) Compare statistics between student and teacher outputs. (3) Lower gap = better match.
Why it matters: A high-quality student should look like the teacher’s images, not just any sharp images. 🍞 Anchor: ArcFlow’s pFID is especially low, signaling it keeps fine details and textures like the teacher.

Standout quantitative notes:

Qwen student: ArcFlow achieves pFID ≈ 3.78 vs. higher numbers for other 2-step students; CLIP ≈ 0.325 (matching the teacher’s alignment score) and notably higher diversity than adversarially trained baselines.
FLUX student: ArcFlow similarly leads FID and pFID among 2-step peers, while staying prompt-faithful.
Overall speed: ~40× speedup over a 50-step teacher pipeline; inference times close to fully fine-tuned baselines despite using small adapters.

Surprising findings:

Linear-shortcut students sometimes score fine on prompt matching but drift from the teacher’s distribution (worse FID/pFID), producing blurrier textures or artifacts. ArcFlow avoids this trade-off by aligning the velocity curves, not just end results.
Training stability: Even with very few trainable parameters, ArcFlow converges faster than methods that update the whole network. Respecting the geometry (curves) seems to make learning easier, not harder.

🍞 Hook: If two runners finish with the same time but one ran the official course and the other cut corners, which is truly better? 🥬 The Concept (Trajectory fidelity):

What it is: Staying on the teacher’s path, not just ending at a similar final image.
How it works: (1) Match instantaneous velocities along the way. (2) Use analytic integration to avoid drift. (3) Preserve teacher priors.
Why it matters: True alignment yields both prompt correctness and realistic, detailed textures. 🍞 Anchor: ArcFlow’s lower FID/pFID show it runs the teacher’s course, not a shortcut that misses key turns.

05Discussion & Limitations

Limitations:

1-step (1 NFE) generation degrades: With only one giant leap, predicting momentum precisely enough is hard, leading to blur. Two steps are the practical minimum for now.
Mode count trade-offs: More momentum modes (K) can help, but returns diminish and add small overhead.
Assumes access to a strong teacher: Without one, you can’t distill these high-quality curves.

Required resources:

A pre-trained diffusion/flow teacher (e.g., Qwen-Image-20B, FLUX.1-dev).
Compute for distillation; while parameter-efficient, training still benefits from multi-GPU setups for large-scale prompts.
Datasets of prompts and evaluation suites (Geneval, DPG, OneIG, Align5000).

When not to use:

Ultra-tiny devices where even 2 NFEs with a large backbone are too heavy.
Scenarios demanding strictly 1-step generation with top quality—ArcFlow isn’t there yet.
If you need to depart from the teacher’s style on purpose (e.g., heavy style transfer), pure trajectory matching may be less ideal.

Open questions:

Can we design richer momentum predictors that stay stable at 1 NFE?
How well does this scale to even larger backbones or to video, 3D, or multimodal tasks with temporal/spatial couplings?
Can we auto-tune K (number of modes) or dynamically allocate modes per prompt?
Could hybrid training mix small doses of adversarial or perceptual objectives without losing trajectory fidelity?
Are there theoretical guarantees for generalization beyond observed timesteps and prompts?

🍞 Hook: If you can clean a room fast and neatly, not just fast, you’ll use it more often. 🥬 The Concept (Practical balance: speed, quality, diversity):

What it is: Hitting the sweet spot—fast renders, crisp details, varied outputs.
How it works: (1) Two exact leaps via analytic integration. (2) Curved trajectories via momentum mixtures. (3) Light adapters for training speed.
Why it matters: Real-world users care about all three, not just one. 🍞 Anchor: ArcFlow images feel like the teacher’s but arrive much faster, making creative tools more responsive.

06Conclusion & Future Work

In three sentences: ArcFlow makes 2-step text-to-image generation possible without sacrificing quality by modeling the teacher’s curved trajectory using a mixture of momentum modes and then integrating it exactly. This high-precision match keeps images sharp and diverse while slashing compute, thanks to parameter-efficient LoRA adapters and a stable training recipe. It outperforms prior few-step methods on fidelity metrics and convergence speed across strong backbones like Qwen-Image-20B and FLUX.1-dev.

Main achievement: Showing that explicitly learning non-linear (curved) flow with an analytic solver beats linear shortcuts for few-step generation—delivering near-teacher quality at ~40× speedup.

Future directions:

Push quality at 1 NFE with stronger momentum modeling.
Extend to video, 3D, and other modalities where trajectory curvature may be even richer.
Auto-select the number of modes per prompt for efficiency.
Blend with gentle perceptual cues without harming trajectory fidelity.

Why remember this: ArcFlow’s core idea—respect the curve and solve it exactly—turns the few-step problem from a fight against geometry into a ride along it. That shift unlocks fast, faithful image generation that’s practical for real-world creative and interactive systems.

Practical Applications

•Real-time concept art and storyboarding where creators preview many styles in seconds.
•Interactive product design mockups (packaging, logos, materials) with high texture fidelity.
•Live AR filters and overlays that must render quickly yet look detailed and coherent.
•Rapid mood boards and layout exploration for marketing and UI/UX teams.
•On-device or edge generation with tight latency budgets (e.g., kiosks, exhibits).
•Game prototyping: swiftly generate environment variations consistent with an art bible.
•Education demos that visualize physics, history, or biology scenes on the fly.
•E-commerce: instant lifestyle renders of products with accurate textures and lighting.
•Data augmentation for vision tasks where distributional fidelity to a trusted teacher matters.
•Pre-visualization for film/animation, maintaining stylistic consistency at low cost.

Version: 1