Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Hila Manor; Rinon Gal; Haggai Maron; Tomer Michaeli; Gal Chechik

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Intermediate

Hila Manor, Rinon Gal, Haggai Maron et al.2/17/2026

arXiv

Key Summary

•This paper teaches image models to copy a change shown in one image pair and apply it to a new image, like saying 'hat added here, add a similar hat there.'
•Instead of using one tiny add-on (a single LoRA) to handle every kind of change, the authors train a whole toolbox (a basis) of LoRAs.
•A small encoder looks at the three input images and decides how much to use from each LoRA in the toolbox, mixing them on the fly.
•The mixed LoRA is plugged into a powerful image model (Flux.1-Kontext) that sees all three images directly through extended attention to keep details sharp.
•This dynamic mix-and-match approach generalizes better to new, unseen transformation types than previous single-LoRA methods.
•The method improves both how well the edit matches the example (edit accuracy) and how well the original image content is preserved (preservation).
•User studies and VLM-based evaluations show the new method is preferred over several strong baselines across many tasks.
•Ablations reveal that having a larger basis (more LoRAs) helps generalization, while too-high rank or allowing negative mixing can hurt.
•The approach is efficient to train and use, and hints that 'spaces of LoRAs' are a flexible way to control visual edits.
•This could make visual editing tools more reliable for styles, object insertions, poses, and more—with less fiddly prompting.

Why This Research Matters

This work makes image editing by example both easier and more reliable. Instead of guessing the right words to describe a tricky style or object change, users can just show the change and let the system copy it accurately. That means fewer failed edits, less trial-and-error prompting, and more faithful results that preserve the original subject. Designers can transfer styles or insert objects consistently across many images. Educators and creators can adapt looks and layouts quickly without hand-tuning. Overall, it brings us closer to intuitive, example-driven creative tools that work the way people naturally communicate: by demonstration.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine you show a friend two pictures: in the first, a person has no hat; in the second, the same person now wears a red beanie. Then you point to a new photo of a different person and say, “Do the same change!” Your friend understands: add a similar red beanie to the new person.

🥬 Filling (The Actual Concept)

What it is: Visual analogy learning is teaching a computer to copy a change it sees between two images (A to A') and apply the same kind of change to a new image (B to B').
How it works (before this paper): Many systems relied on text editing: you write a prompt like “add a hat.” But text can be vague, and styles, poses, or special looks are hard to describe exactly. So newer methods show the example pair (A, A') plus B, and the model tries to produce B'.
Why it matters: People can demonstrate complex changes—specific styles, exact accessory shapes, or subtle makeup—more precisely than describing them with words.

🍞 Bottom Bread (Anchor) If A shows a dog, and A' shows the same dog turned into a watercolor painting, then B' should turn a new cat (B) into a similar watercolor style.

🍞 Top Bread (Hook) You know how a big backpack can carry lots of tools, but sometimes you only need a few special ones for a job?

🥬 Filling (LoRA, the key ingredient)

What it is: A LoRA (Low-Rank Adapter) is a tiny add-on to a big model that gently nudges it to act differently without retraining everything.
How it works: (1) Keep the big model’s original weights frozen. (2) Learn two small matrices whose product is a tiny “nudge” (low-rank update). (3) Add this nudge during inference to adapt behavior. (4) Because the update is low-rank, it’s fast and memory-friendly.
Why it matters: One LoRA can teach a model a new style or trick cheaply. But one LoRA trying to cover all possible transformations is like one multi-tool doing every job—handy but limited.

🍞 Bottom Bread (Anchor) A single LoRA can make a photo look like a pencil sketch. But making every kind of change—pencil, watercolor, sci-fi armor, pose swap—strains a single LoRA.

🍞 Top Bread (Hook) Think of sharpening a fuzzy photo by turning one knob at a time until it looks clear.

🥬 Filling (Diffusion/flow models)

What it is: Diffusion or flow-based models start from noisy or in-between states and learn a path back to a clean, realistic image.
How it works: (1) Start with a latent that blends noise and real data. (2) A network learns a “velocity field” that points from noisy to clean. (3) Step-by-step, the model follows this field to reach a sharp image. (4) Conditioning (like text or images) guides the path.
Why it matters: These models are great at generating and editing images while keeping them realistic.

🍞 Bottom Bread (Anchor) Flux.1-Kontext is a flow-based image model used here for editing guided by examples.

🍞 Top Bread (Hook) When you look at a collage, your eyes focus on the important parts without losing track of the whole picture.

🥬 Filling (Extended attention)

What it is: Extended attention lets the model look at multiple images at once and connect the right details across them.
How it works: (1) Feed the model a 2×2 grid with A, A', B, and a placeholder for B'. (2) The model’s attention lines up matching regions and details. (3) It uses this context to guide the edit for B'.
Why it matters: This preserves fine details of B (like identity, layout) while transferring the right kind of change from A→A'.

🍞 Bottom Bread (Anchor) If A' adds a crystal crown, extended attention helps the model place a matching crown on B’s head, not on the background.

🍞 Top Bread (Hook) Imagine a paint set with many small colors you can mix to get just the shade you want.

🥬 Filling (What was missing)

The problem: Prior methods used a single LoRA to handle every kind of change, which limited generalization to unseen transformations.
The gap: We need a flexible set of small transformation pieces that can be mixed per task.
The idea: Learn a toolbox (basis) of LoRAs and a tiny chooser (encoder) that picks and mixes the right ones for each analogy.

🍞 Bottom Bread (Anchor) Instead of one “universal” LoRA, this paper builds a shelf of 32 small LoRAs and learns to blend them differently for every edit request.

02Core Idea

🍞 Top Bread (Hook) You know how a DJ mixes several tracks to create the perfect vibe for each party?

🥬 Filling (Aha! in one sentence) The key insight is to learn a basis (toolbox) of LoRAs and dynamically mix them—based on the input example pair—into a single tailored adapter for each analogy task.

Multiple analogies for the same idea:

Smoothie bar: pick different fruits (LoRAs) depending on the customer’s taste (A, A', B), and blend them to make the perfect smoothie (mixed LoRA).
Lego kit: assemble different bricks (LoRAs) into the structure needed for today’s build (this analogy), not one pre-built toy for all cases.
Sports team: choose players with the right skills (LoRAs) for this match (current analogy), instead of always playing the same lineup.

Before vs After: Before: One adapter tried to handle all edits, often failing on new styles or complex changes. After: A small encoder looks at the analogy triplet, picks weights for each LoRA in the basis, and creates a custom adapter on the spot, improving generalization and edit accuracy.
Why it works (intuition):

Many visual transformations are combinations of simpler primitives (style shift, color change, texture add, local insertions).
A basis of LoRAs can learn these primitives across layers of the model.
A learned router (the encoder) selects a mixture that matches the requested change, guided by the example pair.
Because the basis is shared and trained jointly, it becomes smooth and interpolatable: new transformations land between known ones.

Building Blocks (with sandwich explanations): 🍞 Hook: Imagine having a set of musical notes you can combine into any melody. 🥬 Concept: Learnable weight basis of LoRAs is a set of small adapters trained together so their mixes express many transformations. How: (1) Keep N small LoRAs per targeted weight/layer. (2) Train them jointly so combinations cover diverse edits. (3) Encourage reuse across tasks. Why: A single LoRA can’t cover the whole space; a basis can span it. 🍞 Anchor: Mixing a “texture LoRA” with a “color LoRA” can yield “glossy watercolor.”

🍞 Hook: Choosing the right crayons before you draw. 🥬 Concept: Dynamic composition of LoRAs means computing a weighted mix of the basis per analogy. How: (1) Encode A, A', B with a ViT (like CLIP). (2) Project to a query vector. (3) Compare against learned keys, get softmax weights. (4) Linearly combine LoRAs using those weights. Why: Different analogies need different blends; fixed weights can’t adapt. 🍞 Anchor: For “add a crystal crown,” weights favor LoRAs that capture local object insertion and sparkly textures.

Result: The mixed LoRA plugs into Flux.1-Kontext, which sees the full 2×2 image grid via extended attention, producing B' that matches both the example transformation and B’s details.

03Methodology

Overview: Input {A, A', B} + prompt → [Step A: Encode and query] → [Step B: Select mix weights] → [Step C: Combine LoRAs per layer] → [Step D: Build 2×2 context] → [Step E: Guided generation with extended attention] → Output B'.

Step A: Encode and query

What happens: A frozen vision transformer (CLIP ViT) encodes each image A, A', and B separately. Their embeddings are concatenated and passed through a small learned projection, producing a query vector q.
Why this step exists: Separate encodings preserve which image is which (A vs A' vs B) and avoid losing details from shrinking a whole 2×2 grid to CLIP’s fixed size. Without it, the router might misunderstand the analogy or miss fine cues.
Example: If A→A' adds a blue crystal crown, q captures “local head addition + crystalline + blue hue,” plus B’s head context.

Step B: Select mix weights with learned keys

What happens: Each LoRA in the basis has a learned key vector. The model compares q to all keys, applies softmax, and gets positive weights that sum to 1.
Why this step exists: Softmax keeps mixtures stable and prevents extreme or conflicting combinations. Allowing negative or unconstrained weights can push the model off-distribution.
Example: For “steampunk portrait,” higher weights go to LoRAs that encode metallic textures, brown palettes, and face-localized edits.

Step C: Combine LoRAs per targeted layers

What happens: For each targeted weight matrix in Flux.1-Kontext, the method linearly combines the N basis LoRAs using the selected weights to form one “mixed LoRA” update for that layer.
Why this step exists: Different layers represent different semantics; composing per layer captures richer transformations. Without per-layer mixing, edits could be too crude or mislocalized.
Example: Early layers might handle layouts (where the crown sits), while later layers add sparkle and blue tint.

Step D: Build a 2×2 context image

What happens: Construct a grid [A, A'; B, B] and feed it to Flux.1-Kontext with the mixed LoRA and a guiding prompt.
Why this step exists: Extended attention across the grid aligns details from A→A' to B, while preserving B’s identity and structure. Without the grid, the model loses direct visual guidance.
Example: The model sees exactly what changed between A and A' (a crown on the head) and where to perform a similar change on B (also the head).

Step E: Guided generation (flow matching)

What happens: The flow model iteratively refines the bottom-right quadrant to become B'. The prompt helps resolve ambiguity (e.g., “crystal crown”), while the example pair nails the exact look.
Why this step exists: The flow model ensures realistic, coherent images. Without it, edits could be blurry or inconsistent.
Example: The final B' shows the subject with a matching crystal crown, correct placement, and preserved facial identity.

Concrete data example

Inputs: A = panda without crown; A' = same panda with a blue crystal crown; B = fox without crown; prompt = “Give this creature a crown of crystals.”
Router encodes (A, A', B) → q; softmax over keys → weights; mixed LoRA per layer built.
Flux.1-Kontext gets the 2×2 grid and generates B' = fox with a blue crystal crown, placed on the head, crystal style matching A'.

What breaks without each step

No separate encodings: The router may confuse which image is A vs B, reducing edit accuracy.
No softmax: Negative or huge weights can destabilize results and drift off-distribution.
Single-layer mixing: Misses fine-grained control; edits may be misplaced or too global.
No 2×2 context: Loses precise visual analogy; results become prompt-only and less faithful.
No flow backbone: Lower realism and weaker preservation.

Secret sauce

A learnable basis (e.g., N=32, rank r=4) trained jointly across layers creates a smooth, interpolatable space of edit primitives.
A tiny, frozen-backbone encoder (CLIP) plus a small projector keeps routing lightweight and stable.
Extended attention lets the generator read all three images directly, preserving details while transferring the right change.
Decoupling “what edit to do” (router + basis) from “how to render it” (Flux.1-Kontext) improves both flexibility and quality.

04Experiments & Results

The test

What they measured: Two things matter most: (1) Edit Accuracy—does B' apply the same transformation seen in A→A'? (2) Preservation—does B' keep B’s original identity, layout, and details that shouldn’t change?
Why: Great edits both match the requested change and keep the rest intact.

The competition

Baselines include: (a) A single-LoRA Flux.1-Kontext model with similar parameter budget, (b) RelationAdapter, (c) VisualCloze, and (d) EditTransfer.

Datasets and setup

Training on Relation252k (16k analogy pairs across 208 tasks) and evaluation on a combined suite of 840 analogy triplets spanning 100 tasks, including many unseen transformations (both in-domain and out-of-domain via community LoRAs).

Scoreboard with context

VLM-based metrics (using Gemma-3) rate both Edit Accuracy and Preservation. LoRWeB pushes the Pareto front—achieving high accuracy while also preserving B better than baselines. Think of it as scoring an A when others hover around B-range, and doing so on both “match the change” and “keep the original” at once.
Standard metrics like LPIPS (lower is better for preservation) and CLIP directional similarity (higher is better for matching the change) also favor LoRWeB’s balance.
Pairwise preferences: In 2-alternative-forced-choice tests, both a VLM and human users preferred LoRWeB’s results over baselines in most head-to-head comparisons (e.g., around 70% vs EditTransfer and around 68% vs VisualCloze), indicating consistent perceived quality.

Surprising findings

Bigger rank isn’t always better: Raising rank (r) too high can overfit and hurt editability, showing that a moderate rank with a larger basis (more LoRAs) generalizes best.
Negative mixing hurts: Using tanh (allowing negative weights) underperformed; softmax’s positive, normalized weights led to more stable and realistic edits.
Input layout matters: Encoding A, A', and B separately (not as a squished 2×2 into CLIP) improved the router’s understanding of the analogy and boosted accuracy.
Encoder choice: Swapping CLIP for SigLIP2 yielded similar trends, suggesting the method is robust to reasonable encoder backbones.

Overall LoRWeB consistently delivered edits that better matched the example transformation while preserving subject identity and structure, especially on unseen tasks—exactly where single-LoRA methods typically struggle.

05Discussion & Limitations

Limitations

Very out-of-distribution edits (far from the training corpus and the learned basis) can still fail or look weak.
If the example pair (A, A') is ambiguous or low quality, the router may choose poor mixtures.
Some extremely precise geometric edits (e.g., complicated pose re-rigs) may need stronger structural priors.

Required resources

A strong image backbone (Flux.1-Kontext or similar) and a vision encoder (e.g., CLIP ViT) are needed.
Training reported with a single H100 GPU for about 10k steps; inference cost is similar to standard LoRA-edited diffusion/flow models, with minor overhead for routing and mixing.

When not to use

Purely text-driven edits where no example is available and the task is simple (a single text prompt may suffice).
Domains with zero coverage in training and no analogous examples to guide the router.

Open questions

Can we safely enable signed (positive/negative) mixing with better normalization, unlocking a richer basis while staying stable?
What is the best encoder for routing—could multi-scale or dense features improve fine-grained analogy matching?
How large should N (basis size) be before diminishing returns, and how should ranks vary across layers?
Can this basis idea carry over to video, 3D, or multi-modal edits where temporal or spatial consistency is critical?
How can we auto-curate example pairs to maximize clarity and reduce ambiguity for the router?

06Conclusion & Future Work

Three-sentence summary This paper introduces LoRWeB, which learns a basis (toolbox) of LoRAs and dynamically mixes them per task to complete visual analogies. A small encoder examines the example pair and the target image, selects mixture weights, and the resulting mixed LoRA plugs into a powerful generator that sees all images via extended attention. This yields higher edit accuracy and better preservation on both seen and unseen tasks than single-LoRA approaches.

Main achievement They show that a learned, jointly trained LoRA basis can span a useful “space of visual edits,” and that dynamic mixing at inference time significantly improves generalization and quality for analogy-based image editing.

Future directions Extend the approach to video and 3D, explore richer routing encoders and safer signed mixing, and study how to scale basis size and rank per layer. Investigate automatic example-pair selection and active learning to broaden the covered transformation space.

Why remember this Instead of forcing one adapter to do everything, build a toolbox and learn how to pick and mix the right tools for each job. This simple idea—spanning a space of LoRAs and composing them on demand—can make visual editing far more flexible, accurate, and robust to new challenges.

Practical Applications

•Consistent style transfer across product catalogs (e.g., convert all items to watercolor or pencil style).
•Object insertion or modification guided by examples (e.g., add matching hats or accessories to portraits).
•Pose or layout mirroring (e.g., copy a posture change from one athlete photo to another).
•Makeup, costume, and fashion look replication from a reference pair to new photos.
•Background replacement that matches an example transformation (e.g., swap to a studio backdrop).
•Art direction for films and games by demonstrating look changes instead of crafting long prompts.
•Branding consistency: apply a brand’s treatment or texture across varied imagery via a single example pair.
•Educational tools showing step-by-step visual transformations that can be reapplied to new images.
•Photo restoration or enhancement styles learned from before/after exemplars.
•Rapid A/B testing of visual edits by swapping example pairs to steer subtle changes.

Version: 1