VecGlypher: Unified Vector Glyph Generation with Language Models

Xiaoke Huang; Bhavul Gauri; Kam Woh Ng; Tony Ng; Mengmeng Xu; Zhiheng Liu; Weiming Ren; Zhaochong An; Zijian Zhou; Haonan Qiu; Yuyin Zhou; Sen He; Ziheng Wang; Tao Xiang; Xiao Han

VecGlypher: Unified Vector Glyph Generation with Language Models

Intermediate

Xiaoke Huang, Bhavul Gauri, Kam Woh Ng et al.2/25/2026

arXiv

Key Summary

•VecGlypher is a single language-model-based system that writes SVG code to draw crisp, editable letters (glyphs) directly from text descriptions or a few example images.
•It treats font design like writing a recipe: given words like 'rounded, playful, humanist' or some reference glyph images and a target letter, it types the exact SVG path commands that define the outline.
•A typography-aware data pipeline (cleaning, normalizing, deduplicating, and rounding coordinates) keeps the SVG sequences stable and easy for the model to learn.
•Training happens in two stages: first, the model learns to 'draw in SVG' from 39K noisy Envato fonts; second, it learns to follow style instructions and copy styles from images using 2.5K expertly tagged Google Fonts.
•The model outputs watertight, one-pass vector paths—so there’s no raster image step and no messy vectorization afterward.
•On tough tests with never-seen font families, VecGlypher beats both specialized vector-font methods (like DeepVecFont-v2 and DualVector) and general LLMs by large margins.
•Using absolute coordinates for SVG commands works best at larger model sizes, improving geometry and style consistency.
•The approach lowers the barrier to font creation: you can design with words or bootstrap from a few examples, then get fully editable curves instantly.
•Results suggest that, with the right data and recipe, language models can become a unified engine for professional typography.
•This sets a foundation for future multimodal design tools that are more accessible, flexible, and scalable.

Why This Research Matters

Readable, beautiful text is everywhere—from homework sheets and school apps to bus signs, games, and brand logos. VecGlypher makes high-quality font creation as simple as describing a style in words or showing a few examples, which means more people can create accessible, fun, and culturally expressive type. Clean, one-pass vectors also save time and prevent errors for professionals who need precise, editable outlines. Teachers and students can prototype classroom-friendly alphabets and learning tools quickly. Product teams can iterate on UI fonts and icons faster without sacrificing clarity. As this approach expands to more writing systems, it can help preserve and celebrate diverse scripts in digital spaces.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how coloring books have outlines you can fill in, and those outlines stay crisp no matter how much you zoom in? That’s like vector drawings for letters.

🥬 Filling (The Actual Concept):

What it is: SVG is a way to draw pictures using text instructions like “move here, draw a line there, make a curve,” so the computer can render perfect, zoom-friendly shapes.
How it works:
1. You write commands such as M (MoveTo), L (LineTo), Q (Curve), and Z (ClosePath).
2. Each command has numbers (coordinates) that say where to go.
3. The computer follows these instructions to draw smooth outlines.
Why it matters: Without SVG (or similar vector code), letter shapes would be blurry when scaled and hard to edit precisely.

🍞 Bottom Bread (Anchor): If you want a giant poster with a letter “G,” SVG makes it look sharp on a billboard and tiny on a watch screen—always clean.

🍞 Top Bread (Hook): Imagine building a LEGO letter. You don’t just have blocks—you need them to fit the right curves and corners. Letters have structure.

🥬 Filling (The Actual Concept):

What it is: Glyph geometric structures are the shapes and rules (curves, corners, thicknesses, and holes) that make letters readable and consistent.
How it works:
1. Define the main outline and any inner holes (like the counter inside “a” or “o”).
2. Keep thickness and curves consistent across letters in the same style.
3. Ensure paths close properly so the shapes fill correctly.
Why it matters: Without careful structure, letters look broken, uneven, or unreadable.

🍞 Bottom Bread (Anchor): Think of the lowercase “e”—if its inner opening collapses or its curve kinks, the letter becomes hard to read at small sizes.

🍞 Top Bread (Hook): When you tell a friend, “Make it bold and playful,” they picture a vibe and apply it to everything they design.

🥬 Filling (The Actual Concept):

What it is: Design intent is the style or feeling (like “rounded,” “vintage,” “sincere”) you want the letters to express.
How it works:
1. Choose style words that describe thickness, curves, endings (serifs), and spacing.
2. Apply those choices consistently to all letters.
3. Adjust details (like terminals or contrast) to match the intended mood.
Why it matters: Without clear intent, a font family feels mismatched, like wearing a tuxedo jacket with gym shorts.

🍞 Bottom Bread (Anchor): “Playful, rounded, friendly” might lead to soft corners and wider, bouncier shapes—great for a kids’ book title.

🍞 Top Bread (Hook): Think of the rules of a sport—soccer needs a round ball, two goals, and a field; typography has rules too.

🥬 Filling (The Actual Concept):

What it is: Typographic constraints are the rules letters must follow: clean closures, proper counters, consistent strokes, and alignment.
How it works:
1. Keep path topology correct (closed shapes, no gaps).
2. Maintain consistent features across the alphabet (e.g., serif style, stroke contrast).
3. Align to shared baselines and sizes so words look even.
Why it matters: Break these rules and letters become hard to read, impossible to fill, or stylistically chaotic.

🍞 Bottom Bread (Anchor): If the lowercase “n” leans one way and “m” leans another, a word like “anna” looks wobbly and amateurish.

🍞 Top Bread (Hook): You know how you can tell a smart speaker, “Play relaxing music,” and it understands your words? Computers can learn language.

🥬 Filling (The Actual Concept):

What it is: Natural language processing (NLP) lets computers understand and use human language to follow instructions.
How it works:
1. Break sentences into tokens (words/pieces).
2. Learn patterns that connect words to meanings.
3. Use those meanings to act—like producing matching designs.
Why it matters: If we can describe fonts in words, then NLP can help turn descriptions into actual letter shapes.

🍞 Bottom Bread (Anchor): Saying “humanist sans-serif, calm, rounded” can guide an AI to shape friendly, smooth letters without any drawing from you.

The world before: Most automated font tools needed several example images of letters (exemplars) and then converted images to vectors afterward. That meant you had to already have a mini font sheet and endure a fussy raster-to-vector step, which often created messy curves and errors. It also locked out non-experts who’d rather describe a vibe than prepare reference images.

The problem: Designers want to say “make it narrow, high-contrast, vintage” and get clean, editable glyphs—without making their own example sheets. But regular language models that can draw simple icons in SVG stumble on fonts: letters need long sequences of precise coordinates, watertight paths, and style consistency across many characters.

Failed attempts: General LLMs prompted to “write an SVG path for a serif V” often output broken paths, wrong characters, or unmatched cases. Specialized pipelines relied on images-and-postprocessing, which created vectorization artifacts and still required exemplar sheets.

The gap: We needed one model that could understand style words or a few example images and then write correct SVG paths directly—no raster detours—and keep geometry and style tight and consistent.

Real stakes: This matters for everyone who reads, types, brands, or designs. Clean vectors mean crisp screens and print, fast iteration for logos and UI, and lower barriers for students, teachers, indie creators, and pros. If you can type your idea and get true vectors, you can explore styles in minutes instead of days.

02Core Idea

🍞 Top Bread (Hook): Imagine telling a robot artist, “Draw me a friendly, rounded G like this,” and it types the exact drawing instructions—no tracing needed.

🥬 Filling (The Actual Concept):

What it is: The key insight is to treat drawing letters as writing text—have a language model type the SVG code that defines each glyph, guided by style words or example images.
How it works:
1. Feed the model either (a) style tags in text, or (b) a few glyph images plus the target character.
2. The model predicts SVG tokens one by one (commands and numbers) to form the outline.
3. The result is a valid, watertight SVG path you can edit right away.
Why it matters: It unifies text- and image-referenced generation, drops raster steps, and gives you clean vectors in one pass.

🍞 Bottom Bread (Anchor): You type “humanist sans-serif, calm, rounded” and ask for “G.” The model outputs a <path d="..."> that you can paste into any vector editor.

🍞 Top Bread (Hook): You know how some friends understand both what you say and what you show them? That’s a superpower.

🥬 Filling (The Actual Concept):

What it is: Multimodal models can read text and look at images, then combine both kinds of clues to make something new.
How it works:
1. Tokenize words to understand style intent.
2. Encode example images to capture visual features (strokes, serifs, contrast).
3. Fuse these signals to guide SVG code generation.
Why it matters: Designers can start with words only or bootstrap with a few reference glyph images—one model handles both.

🍞 Bottom Bread (Anchor): Show 3 glyphs from a fancy font and say “draw ‘V’ like this.” The model copies the style and makes a matching “V” in vector form.

🍞 Top Bread (Hook): Think of writing a story one word at a time, each word chosen because of the words before it.

🥬 Filling (The Actual Concept):

What it is: Autoregressive prediction means the model outputs the next token (like the next letter or number) based on what it has already written.
How it works:
1. Start with the prompt (text tags or encoded images + target character).
2. Predict the next SVG token (e.g., “M”, a coordinate, “Q”, another coordinate).
3. Repeat until the full path is complete and closed.
Why it matters: This step-by-step recipe is how the model stays consistent over long sequences of coordinates.

🍞 Bottom Bread (Anchor): To draw a “Q” with a tail, the model writes the big oval first (many tokens), then the little tail—always building on what it already put down.

Multiple analogies for the same idea:

Texting a robot pen: You send words like “rounded, playful” and it texts back the exact drawing recipe.
Sheet music for letters: The model writes musical notes (tokens) that, when played, sound like the shape of a glyph.
GPS directions: The model says “move here, turn there, curve now” until the letter is fully mapped.

Before vs. after:

Before: Separate systems for images and vectors, exemplar sheets required, and messy raster-to-vector steps.
After: One model covers both text and image conditioning and types clean SVG directly—fast, consistent, and editable.

Why it works (intuition): Letters are code-friendly. An outline is essentially a long string of precise instructions. Language models excel at producing long, structured text. With enough examples and a good training curriculum, they learn both the “grammar” of SVG (commands and coordinates) and the “style semantics” of typography (serifs, contrast, counters).

Building blocks:

A tokenizer for text prompts.
An image encoder for optional reference glyphs.
A decoder-only LLM that outputs SVG tokens.
A training recipe that first teaches “how to draw in SVG,” then “how to follow style instructions and copy style from images.”
Data preprocessing that normalizes coordinates, canonicalizes paths, and removes bad or duplicate fonts so decoding is stable and clean.

03Methodology

At a high level: Input (style words or reference glyph images + target character) → LLM predicts SVG tokens one by one → Detokenize into a single <path d="..."> for the glyph.

🍞 Top Bread (Hook): Think of writing a shopping list by splitting big ideas into small items you can handle one at a time.

🥬 Filling (The Actual Concept):

What it is: Tokenization breaks long text (including SVG commands and numbers) into bite-size pieces called tokens so the model can process and generate them.
How it works:
1. Split text prompts into tokens (words/subwords).
2. Split SVG path strings into tokens that include command letters, separators, and numeric chunks.
3. The model learns probabilities over these tokens to write valid SVG.
Why it matters: Without tokenization, the model can’t reliably compose long, precise command sequences.

🍞 Bottom Bread (Anchor): The number “-18.1” can be tokenized into pieces the model knows how to spell correctly every time, avoiding typos in coordinates.

🍞 Top Bread (Hook): Picture hanging frames on a wall so every picture lines up at the same height and scale.

🥬 Filling (The Actual Concept):

What it is: Coordinate normalization rescales and aligns glyphs to a shared coordinate system (like UPM=1000 and a common baseline).
How it works:
1. Rescale all glyphs to a common units-per-em (UPM) size.
2. Align baselines so letters sit on the same “floor.”
3. Quantize numbers to one decimal place for stability.
Why it matters: If sizes and baselines vary wildly, the model’s long sequences become unstable and errors snowball.

🍞 Bottom Bread (Anchor): The lowercase “g” from many fonts lands at the same baseline after normalization, so the model doesn’t confuse height or tilt with style.

🍞 Top Bread (Hook): Imagine a librarian who cleans and organizes a messy library so anyone can find the right book fast.

🥬 Filling (The Actual Concept):

What it is: Typography-aware data engineering is the cleaning and organizing of font data so it follows typographic rules and is easy for the model to learn from.
How it works:
1. Remove malformed or overly long paths and deduplicate near-identical fonts.
2. Canonicalize SVG paths: keep command letters explicit and parse only the d="..." attribute.
3. Quantize coordinates and split by font families so tests are truly out-of-distribution.
Why it matters: Without this prep, the model learns from noisy, inconsistent examples and produces broken outlines.

🍞 Bottom Bread (Anchor): If two fonts are basically clones, one is dropped so the model doesn’t get confused or overfit.

🍞 Top Bread (Hook): Think of practicing scales on a piano before attempting a concerto—you first master mechanics.

🥬 Filling (The Actual Concept):

What it is: Large-scale continuation is the first training stage where the model learns to keep writing correct SVG over long sequences using lots of (noisy) examples.
How it works:
1. Train on 39K Envato fonts with up to ~15 marketing-style tags per font.
2. Focus on next-token prediction for SVG paths so the model masters syntax and long-horizon geometry.
3. Use length filtering to avoid pathological, ultra-long paths.
Why it matters: Without this foundation, the model can’t reliably close paths, place control points, or maintain counters.

🍞 Bottom Bread (Anchor): After Stage 1, the model is good at spelling SVG like a typist who never misses a bracket or a minus sign.

🍞 Top Bread (Hook): After you learn scales, a coach teaches you how to play with feeling and follow a conductor.

🥬 Filling (The Actual Concept):

What it is: Post-training on expert-annotated data is the second stage where the model learns to follow precise style words and copy visual style from images.
How it works:
1. Fine-tune on 2.5K Google Fonts with expert, appearance-focused tags.
2. Mix two kinds of samples: text-referenced (style words) and image-referenced (1–8 glyph images).
3. Align language and images to geometry so outputs match requested style and character.
Why it matters: Without this step, the model draws, but not necessarily in the right style or matching the references.

🍞 Bottom Bread (Anchor): Say “humanist sans-serif, calm, superellipse,” or show a few glyphs from that font—the model now accurately mirrors that look.

Putting it all together (step-by-step example):

Input: “Font style: rounded sans-serif, playful, humanist; Content: G.”
The text tokenizer encodes the style words; if you provided example glyph images, an image encoder adds visual clues.
The LLM starts emitting tokens: M x y, L x y, Q qx qy x y, ... until Z (close path).
Detokenize to get the final <path d="..."> string—no post-optimizer, no raster denoiser.
Optional: render to a preview image for a quick look.

Secret sauce:

Single objective (next-token prediction) over a unified output space (SVG path tokens).
Two-stage training that first nails SVG mechanics at scale, then locks in instruction following and style transfer.
Absolute-coordinate serialization that improves geometry at larger model sizes.
A strict system prompt to ensure outputs are SVG-only and syntactically valid.
Family-level data splits to encourage real generalization across unseen styles.

04Experiments & Results

The test: Researchers checked if generated glyphs were recognizable, cleanly shaped, style-faithful, and distributionally close to real fonts—especially for families the model never saw during training (a hard test).

What they measured and why:

Relative OCR Accuracy (R-ACC): Like a reading test—does a strong OCR model read the letter as the intended character? High R-ACC means the glyph is legible and correct.
Chamfer Distance (CD): Measures how close the generated outline is to the true outline; lower is more precise geometry.
CLIP similarity: Checks if the image of the glyph matches the style words; higher is better stylistic alignment.
DINO similarity: Compares deeper visual features; higher suggests closer visual structure to ground truth.
FID: Rates how close the distribution of generated glyphs is to real ones; lower means more realistic overall results.

The competition:

Specialized vector-font methods: DeepVecFont-v2 and DualVector.
General-purpose LLMs: multiple flagship and budget models.

Scoreboard with context:

Text-referenced generation (words only): • VecGlypher-27B hits around R-ACC ~100.5, CD ~1.72, DINO ~94.22, FID ~3.46. • VecGlypher-70B further trims geometry error (CD ~1.68) and FID (~3.34). • Compared to a strong general LLM baseline (e.g., Claude Sonnet 4.5), VecGlypher-70B delivers roughly 2. $15× higher$ recognizability, about 68% lower geometry error, and about 83% lower FID—like going from a B- to an A+ while also writing neater handwriting.
Image-referenced generation (with 1–8 example glyph images): • VecGlypher-27B reaches R-ACC ~99.12, CD ~1.18, DINO ~95.82, FID ~2.32. • Against DeepVecFont-v2 and DualVector, it roughly doubles recognizability and slashes geometry error by ~92% and FID by ~97.8%—that’s like beating the class average by a landslide while keeping every line crisp.

Surprising findings and ablations:

Model scale matters: Jumping from ~4B to ~27B parameters greatly improves closures, counters, and thin-stroke fidelity. Bigger models stay on track over long coordinate sequences.
Two-stage recipe is crucial: Training first on large, noisy Envato (to learn SVG mechanics) and then on expert-tagged Google Fonts (to align style/appearance) outperforms single-stage mixes.
Absolute vs. relative coordinates: Absolute coordinates win at larger scales, yielding sturdier geometry and consistency. At very small scales, relative can slightly help recognizability but tends to hurt geometry.
Envato-only vs. Google-only: Envato-only learns to “spell SVG” but lacks precise geometry/stylistic fidelity; Google-only improves fidelity but misses the breadth of long-sequence practice. Together, staged, they shine.

Takeaway: On hard, never-seen font families, VecGlypher writes SVG like a careful, style-savvy typographer—clean, closed, and faithful—while competitors either struggle to produce valid paths or lose style and geometry along the way.

05Discussion & Limitations

Limitations:

Current scope covers digits and Latin letters (0–9, a–z, A–Z). Extending to accented characters, scripts like Devanagari, Arabic, or complex cursive requires more data and possibly stroke- or component-based programs.
Smaller models are brittle. Stable quality today appears around ~30B parameters unless better tokenization, constrained decoding, or glyph-focused adapters are added.
General LLMs trained on icons don’t pick up typographic rules automatically; without glyph-rich data, outputs can be invalid or off-style.
Evaluation still relies on proxies (OCR, CLIP, DINO, FID). Human typography judgment (kerning comfort, rhythm, family-wide coherence) is richer and harder to quantify.

Required resources:

A capable multimodal LLM backbone (tens of billions of parameters for best results) and multi-GPU training.
Large, cleaned corpora of vector glyphs with normalized coordinates and reliable tags or references.
A strict prompting and decoding setup to ensure SVG-only outputs.

When NOT to use:

If you need precise kerning pairs, ligature sets, hinting, and production-ready typography features out of the box—those remain downstream tasks.
If you require scripts far beyond the trained set without suitable training data.
If your environment cannot support large models and you cannot accept minor geometry defects from a small model.

Open questions:

How to scale to open-ended writing systems efficiently—can component sharing (radicals, diacritics) and trajectory-aware strokes improve generalization?
Can lightweight geometry constraints during decoding or learned validators boost closure, winding, and counter stability at smaller scales?
What are the best blended rewards (geometry + topology + recognizability + style) for RL fine-tuning and best-of-N sampling?
How far can we compress (distill) a strong model to run on-device while keeping vector fidelity and style control?

06Conclusion & Future Work

Three-sentence summary: VecGlypher turns font drawing into text generation by having a multimodal language model write SVG path code for glyphs directly from style words or example images. A typography-aware data pipeline and a two-stage training recipe—first learning SVG mechanics at scale, then aligning to expert style cues—produce clean, watertight, and stylistically faithful vectors. On tough tests with unseen font families, VecGlypher surpasses both specialized vector-font methods and general LLMs, showing that LLMs can be a unified engine for professional typography.

Main achievement: Unifying text- and image-referenced vector glyph generation in a single LLM that emits one-pass, editable SVG paths—no raster intermediates, no post-optimizers—while delivering state-of-the-art geometry and recognizability.

Future directions: Expand to broader scripts with component/stroke programs, add decoding-time geometry constraints, design richer evaluation metrics tied to typographic craft, and explore distillation for smaller, faster models. Interactive tools could combine text prompts, a few exemplars, and real-time edits to co-create full font families.

Why remember this: It’s a shift in interface and engine—design with words (or a few pictures), get true vectors instantly. By treating letter outlines as code a model can write, VecGlypher lowers barriers for learners and pros alike and lays the groundwork for scalable, multimodal design tools that keep the artistry while speeding the craft.

Practical Applications

•Rapidly prototype custom wordmarks and logos by describing style in plain language.
•Create classroom alphabets with friendly, readable shapes tailored to young readers.
•Generate full font sets by starting with a few reference glyph images and expanding automatically.
•Design themed typography for holidays, events, or games without manual tracing or vectorization.
•Produce crisp, scalable text for UI/UX mockups directly as editable SVG.
•Iterate on multiple styles (e.g., rounded vs. slab-serif) in minutes to compare options side by side.
•Build branding systems where headings and icons share consistent geometric logic.
•Assist type designers by drafting initial families that can be refined in vector editors.
•Enable accessible design by quickly testing high-contrast, dyslexia-friendly letterforms.
•Support multilingual projects by bootstrapping new scripts once component/stroke encodings and data are available.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes