Proxy Compression for Language Modeling

Lin Zheng; Xinyu Li; Qian Liu; Xiachong Feng; Lingpeng Kong

Proxy Compression for Language Modeling

Intermediate

Lin Zheng, Xinyu Li, Qian Liu et al.2/4/2026

arXiv

Key Summary

•Most language models are trained on compressed tokens, which makes training fast but ties the model to a specific tokenizer.
•This paper shows a new way, called proxy compression, to train mostly on compressed data but use plain raw bytes at test time.
•During training, the model sees a mix of compressed sequences and the same data as raw UTF-8 bytes, tagged with simple markers.
•The model learns to align the two views so knowledge transfers from compressed training to raw-byte inference.
•At larger sizes, proxy-compressed models match or beat normal tokenizer-based models on code benchmarks while using only raw bytes at inference.
•Neural and tokenizer-based compressors work well as proxies, but gzip does not transfer effectively because its outputs are unstable.
•Proxy-trained models keep much of the robustness of byte-level models to formatting and other surface changes.
•A small warmup phase that pairs compressed and raw versions in the same context helps the model learn to translate between formats.
•This approach improves compute efficiency like tokens while keeping the data efficiency and robustness of bytes.
•No architecture change is needed; it's a training-time data pipeline that can be dropped at inference.

Why This Research Matters

Proxy compression lets us train faster and cheaper without locking models to a fragile tokenizer interface. That means assistants are less likely to break when text has odd spacing, newlines, emojis, or rare words. It also helps support many languages fairly because bytes are universal, avoiding some tokenizer biases. Deployments become simpler: one universal byte interface instead of many tokenizers. As models scale, this approach keeps performance high while improving robustness, making AI tools more reliable in real-world settings.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you have a huge library of books, but your backpack is small. You squish the books into summaries so you can carry more at once. That’s like how we train big language models today.

🥬 Filling (The Actual Concept):

What it is: Data compression is shrinking information so it takes fewer symbols to store and process, while keeping the meaning.
How it works:
1. Find repeating patterns (like 'the' or common code snippets).
2. Replace long patterns with shorter labels (tokens or codes).
3. Save and learn from the short labels instead of the long text.
Why it matters: Without compression, sequences are too long and training becomes too slow and expensive.

🍞 Bottom Bread (Anchor): Emoji messages like '👍👍👍' compress the idea of 'good job' into just a few symbols—faster to send, same meaning.

🍞 Top Bread (Hook): You know how you can guess the next word when your friend starts a sentence? 'Once upon a…' — you think 'time'!

🥬 Filling (The Actual Concept):

What it is: Language modeling is teaching a computer to predict the next symbol in a sequence (word, token, or byte).
How it works:
1. Feed in a sequence of symbols.
2. The model scores possible next symbols.
3. It picks or samples the most likely next one.
4. Repeat until the output is done.
Why it matters: This simple next-step game lets models write code, answer questions, and follow instructions.

🍞 Bottom Bread (Anchor): When asked 'What is the capital of France is…', a language model predicts 'Paris' because it learned that 'France' is often followed by 'Paris' in this kind of sentence.

🍞 Top Bread (Hook): Think of turning a long book into puzzle pieces that snap together quickly—so you can read faster with fewer pieces.

🥬 Filling (The Actual Concept):

What it is: Tokenizer-based compression turns raw text into shorter subword tokens so models can train on fewer, bigger chunks.
How it works:
1. Learn common chunks from lots of text (like 'ing', 'tion', 'print(' ).
2. Split new text using these chunks.
3. Process the shorter token sequence.
Why it matters: It makes training fast, but locks the model to the tokenizer and can cause weird artifacts and biases.

🍞 Bottom Bread (Anchor): 'internationalization' becomes 'inter' + 'national' + 'ization'—3 tokens instead of 20+ characters.

The world before: Most models trained on tokens because it’s faster. But this glued the whole system to a specific tokenizer. That glue caused well-known problems—like models being sensitive to prompt boundaries, glitchy rare tokens, unfairness across languages, and trouble with adversarial or slightly messy inputs.

The problem: Can we enjoy the speed of compressed training without being stuck with the compressor forever? In other words, can we train fast but still talk in raw bytes at inference time (the universal, end-to-end, no-special-tools interface)?

Failed attempts: Pure bytes keep the universal interface but are slow. Pure tokens are fast but brittle and locked-in. Generic gzip compression is fast to compute but its outputs are unstable and hard for models to learn from. Just swapping compressors doesn’t solve the lock-in.

The gap: We needed a way to use compressed views as a helper during training, but not as a permanent crutch. The model should learn from both compressed and raw forms so it can speak bytes at test time.

The stakes: In daily life, robustness matters. You don’t want your assistant to break because of one extra space in code, a missing accent, or a new language. You also want training to be efficient and affordable. If we can train fast with compressed inputs but deploy simply with raw bytes, we get the best of both worlds: speed and sturdiness.

🍞 Top Bread (Hook): Think of two maps of the same city: a subway map (simple, compressed) and a street map (detailed, raw). If you learn both at once, you can switch easily.

🥬 Filling (The Actual Concept):

What it is: Next-symbol prediction (again, but now as a bridge) is the shared game that lets the model align compressed tokens with raw bytes.
How it works:
1. Show the model a mixed stream: sometimes raw bytes, sometimes compressed tokens.
2. Use special markers to tell which is which.
3. Always play the same game: predict the next symbol.
Why it matters: A single, shared objective lets knowledge flow between the two formats.

🍞 Bottom Bread (Anchor): If the model learns that 'def ' in bytes often matches a certain code-token, it can use that shortcut during training but still write 'def ' in bytes during testing.

This paper’s answer: Proxy compression—a mixed-representation training scheme where compressed sequences act as a training-time proxy. The model sees mostly compressed inputs (for speed) plus some raw bytes (for grounding), learns to align them, then at test time we throw away the compressor and run on bytes only. It keeps robustness, avoids tokenizer lock-in, and can match tokenized training at scale—especially on code.

02Core Idea

🍞 Top Bread (Hook): You know how training wheels help you learn to ride faster and safer—but you don’t keep them forever? You take them off once you’ve learned.

🥬 Filling (The Actual Concept):

What it is: Proxy compression is using compressed views as training wheels during learning, then removing them and riding on raw bytes at inference.
How it works:
1. Prepare two views of each sample: raw bytes and compressed.
2. Mix them during training, mostly compressed for speed, some raw for grounding.
3. Add simple markers—sentinel tokens—to tell the model which view it’s seeing.
4. Train one model on the same next-symbol task over both views.
5. Optional warmup: sometimes show both views of the same sample together so the model can translate between them in context.
6. At inference, drop the compressor and use raw bytes only.
Why it matters: Without this, you must choose: fast but fragile (tokens) or slow but robust (bytes). Proxy compression blends both benefits.

🍞 Bottom Bread (Anchor): Like practicing a song with slow, color-coded notes (proxy) and then performing it on a normal piano (bytes) without the colors.

Multiple analogies:

Subway vs. street map: Train with both maps; navigate later with the real streets.
Training wheels: Use them to learn faster, then ride freely without them.
Bilingual class: Learn vocabulary (compressed) and full sentences (raw) together; later you can speak in full sentences with confidence.

Before vs. after:

Before: Pick one—tokenized training (compute-efficient but brittle, locked) or byte training (robust but costly).
After: Train mostly on compressed for speed, keep byte grounding, and deploy on bytes only. At larger scales, performance rivals tokenized training while staying robust.

Why it works (intuition):

Shared game: Predict-the-next-symbol is the same task for both views, creating a bridge.
Alignment pressure: Seeing both views (plus occasional paired examples) nudges the model to map compressed chunks to the underlying bytes.
Efficiency + coverage: Compressed inputs mean more examples per unit compute; raw bytes keep the model honest about the real data surface.
Scale helps storage: Bigger models can store the alignment better, so transfer improves with size.

Building blocks (all explained with Sandwich below):

Tokenizer-based compression: a stable, discrete proxy.
Neural proxy compression: a learned, slightly fuzzy proxy that still preserves meaning.
Gzip: generic compression that’s unstable as a proxy for learning.
Format sentinel tokens: simple markers like <raw> and <comp>.
Mixed-representation training: flip a coin per sample; mix views in the same training stream.
In-context translation pairing: briefly show compressed+raw of the same data to jump-start alignment.
Byte-level inference: at test time, always read and write bytes, no compressor needed.

Now, the required Sandwiches for the new concepts introduced here:

🍞 Top Bread (Hook): Think of using abbreviations in your notes to write faster.

🥬 Filling (The Actual Concept):

What it is: Format sentinel tokens are special markers that label which view we’re in (raw or compressed).
How it works:
1. Wrap sequences with tags like <raw> ... </raw> or <comp> ... </comp>.
2. The model reads these tags and conditions its predictions accordingly.
3. During packing, multiple sequences can appear in one context with clear boundaries.
Why it matters: Without clear labels, the model would confuse bytes and tokens and learn the wrong patterns.

🍞 Bottom Bread (Anchor): Like labeling two notebooks 'Math' and 'Science' so you don’t mix notes.

🍞 Top Bread (Hook): Imagine learning to guess the next letter in both big words and their abbreviations.

🥬 Filling (The Actual Concept):

What it is: Mixed-representation training is learning from raw bytes and compressed sequences together.
How it works:
1. For each sample, randomly choose raw or compressed (usually compressed).
2. Pack them together into training batches, marked by sentinels.
3. Train the same next-symbol objective over this mix.
Why it matters: Without mixing, the model wouldn’t connect compressed hints to the real byte surface.

🍞 Bottom Bread (Anchor): Reading the same story once as a summary and once as the full text helps you connect the two.

🍞 Top Bread (Hook): You know how learning piano makes learning the ukulele a bit easier because both use music ideas?

🥬 Filling (The Actual Concept):

What it is: Cross-representation transfer is when knowledge from one view helps you perform in the other.
How it works:
1. Learn patterns from compressed sequences quickly.
2. Keep seeing enough raw bytes to tie patterns back to the surface.
3. Reuse those learned patterns when predicting on raw bytes at test time.
Why it matters: Without transfer, training on compressed data wouldn’t help byte-level inference.

🍞 Bottom Bread (Anchor): After practicing with flashcards (compressed), you can still write the full essay (raw) better and faster.

03Methodology

At a high level: Raw/Compressed Input → Tag with Sentinels → Mix in One Training Stream → Next-Symbol Prediction → Optional Early Pairing → Discard Compressor → Byte-Level Inference

Step-by-step with Sandwich explanations for each core piece:

Data compression choices (three proxies)

🍞 Top Bread (Hook): Picture three ways to pack a suitcase: by outfits (tokenizer), by a smart robot (neural), or by squishing randomly (gzip).

🥬 Filling (The Actual Concept):

What it is: Tokenizer-based compression turns bytes into stable subword tokens; neural compression uses a small byte model plus arithmetic coding; gzip is a generic file compressor.
How it works:
- Tokenizer-based:
  1. Learn frequent chunks.
  2. Split text into those chunks.
  3. Output discrete token IDs.
- Neural:
  1. Train a small byte-level model to estimate p(byte | history).
  2. Use arithmetic coding to pack bytes into bits based on those probabilities.
  3. Pack bits into fixed-size symbols (e.g., 16-bit) for the LM.
- Gzip:
  1. Apply standard gzip to bytes.
  2. Treat output bytes as symbols.
Why it matters: The proxy you choose shapes what the model learns quickly. Stable, structured proxies transfer well; unstable ones don’t.

🍞 Bottom Bread (Anchor): For code, tokenizer-based and neural packing preserved meaning and patterns, while gzip’s outputs changed wildly with tiny edits and didn’t teach well.

Next-symbol prediction objective (shared bridge)

🍞 Top Bread (Hook): Whether you read the summary or the whole book, you can still guess the next sentence.

🥬 Filling (The Actual Concept):

What it is: Next-symbol prediction is the single training game played over both views.
How it works:
1. Feed a mixed stream (raw and compressed) marked by sentinels.
2. Predict the next symbol—either next byte or next compressed symbol.
3. Update one shared model.
Why it matters: This keeps learning unified so knowledge can move across views.

🍞 Bottom Bread (Anchor): The model might learn from tokens that 'def <name>(' signals a function, and then use that to better predict the bytes 'def ' later.

Format sentinels (clear labels)

🍞 Top Bread (Hook): Like putting colored tabs on the edges of book sections.

🥬 Filling (The Actual Concept):

What it is: <raw> ... </raw> and <comp> ... </comp> tokens mark which view is active.
How it works:
1. Wrap each sequence with the right pair of tags.
2. Train the model to condition on these markers.
3. Optionally pack multiple sequences per context; tags prevent mixing.
Why it matters: Without tags, the model could mash formats together and get confused.

🍞 Bottom Bread (Anchor): With tags, the model knows whether '101' means a byte value or a token ID.

Mixed-representation sampling (efficiency with grounding)

🍞 Top Bread (Hook): Study mostly with flashcards (fast), but sometimes read the full chapter (ground truth).

🥬 Filling (The Actual Concept):

What it is: A mixing rate decides what fraction of samples appear compressed vs. raw.
How it works:
1. Flip a biased coin per sample (e.g., 90% compressed, 10% raw).
2. Pack sequences into fixed-length contexts with document-boundary masking (no cross-doc peeking).
3. Keep batch size in symbols fixed, so compressed runs cover more unique documents per step.
Why it matters: Mostly-compressed training sees more data per FLOP, while the raw 10% keeps the model fluent in bytes.

🍞 Bottom Bread (Anchor): With 90% compressed, the model can see about 3× more documents than with 100% raw under the same compute.

In-context translation pairing (early booster)

🍞 Top Bread (Hook): When learning a new word, it helps to see it next to its definition once or twice.

🥬 Filling (The Actual Concept):

What it is: Briefly show both views of the same sample back-to-back during a warmup.
How it works:
1. Concatenate <raw> raw_bytes </raw> <comp> compressed </comp> (order randomized).
2. Do this for the first ~10k steps.
3. Turn it off later to avoid duplicating data and slowing effective compression.
Why it matters: This kick-starts the mapping between views, making later transfer stronger.

🍞 Bottom Bread (Anchor): Like seeing a sentence in English and Spanish together for a short period so you can connect them.

Neural proxy details: entropy-based segmentation

🍞 Top Bread (Hook): When cutting a cake, you slice where it’s easiest and makes sense—big frosting changes, edges, or decorations.

🥬 Filling (The Actual Concept):

What it is: Entropy-based segmentation splits input where predictability shifts, so many segments can be compressed in parallel.
How it works:
1. Run the small byte model; compute per-byte entropy (surprise).
2. Mark boundaries at high entropy or big jumps.
3. Compress segments independently with arithmetic coding (batched), pack bits into 16-bit symbols.
Why it matters: Naive arithmetic coding is too slow; segmentation makes neural compression practical at scale and improves learning quality.

🍞 Bottom Bread (Anchor): Code often changes style at newlines or blocks. Segmenting there lines up with meaning and speeds up compression.

Byte-level inference (simple, robust interface)

🍞 Top Bread (Hook): After practicing with special diagrams, you perform the concert on a regular piano.

🥬 Filling (The Actual Concept):

What it is: At test time, you only use raw UTF-8 bytes, no compressor needed.
How it works:
1. Feed bytes to the trained model.
2. Predict next bytes; generate answers or code.
3. Enjoy robustness to small changes (spaces, newlines, accents).
Why it matters: This removes lock-in to any tokenizer and keeps the system end-to-end.

🍞 Bottom Bread (Anchor): A proxy-trained code model can handle extra whitespace or a different function name style without breaking.

Concrete mini example:

Raw bytes: <raw> d, e, f, , m, a, i, n, (, ), :, … </raw>
Tokenized proxy: <comp> 733, 8223, 1187, … </comp>
During warmup, the model sometimes sees both in one context; later it sees mostly <comp> with some <raw>.
At inference, we give only bytes; the model still writes correct code in bytes.

Secret sauce:

The single next-symbol objective over both views builds an internal dictionary between compressed patterns and raw bytes.
Mixing favors speed (more examples per FLOP) without forgetting the real surface.
Scale amplifies alignment storage, making transfer stronger with larger models.

04Experiments & Results

The test: Can a model trained mostly on compressed data (with a small portion of raw) perform well when we only let it read and write raw bytes at test time?

Setup:

Data: RefineCode (Python subset ~270 GB; full GitHub split ~3.3 TB).
Models: EvaByte family at 0.5B, 1.5B, 4B, 7B, 14B parameters.
Training budget: Fixed number of sequence symbols per step; proxy mix r ≈ 0.9 (≈90% compressed, 10% raw). Warmup pairs for ~10k steps.
Baselines: Pure byte-level (all raw), pure tokenizer-based (all tokens).
Metrics: Downstream code generation (HumanEval, MBPP and their Plus versions), Pass@1; also robustness on ReCode.

Scoreboard with context:

On MBPP-Plus, the 14B proxy (tokenizer proxy) hits ~49.5% Pass@1 vs. ~48.1% for tokenizer and ~42.1% for byte-only. That’s like getting an A when tokens get an A- and bytes get a B-.
On HumanEval-Plus, at 14B, proxy models reach ~30% Pass@1, matching/rivaling tokenizers (~29–30%) and beating byte-only (~24%).
Across sizes, proxy models beat byte-only at fixed compute, and the gap grows with scale. Small models may transfer weakly, but big models align the views much better.

Compute vs. data efficiency:

Tokenizers: more compute-efficient (fewer, bigger tokens → more data per FLOP), but less robust.
Bytes: more data-efficient under some settings (more optimization per byte), but slower overall.
Proxy compression: captures the best of both. Under the same FLOPs, proxy ≈ tokenizer. Under the same total data, proxy ≈ byte-level data efficiency while still outperforming tokenizer on tasks.

Longer training horizons (320B symbols on full GitHub):

At 7B, proxy matches or exceeds tokenizer on both HumanEval-Plus and MBPP-Plus, while byte-only trails.
At 1.5B, proxy > bytes but still < tokenizer—again showing transfer strengthens with scale and time.

In-context translation probe:

Show compressed prompt+solution, then the raw prompt, and ask the model to produce the raw solution.
Even without explicit pairing, models translate surprisingly well; with always-on pairing, translation becomes near-perfect (>95%).
Warmup-only pairing boosts early alignment; later, even when translation accuracy drops, downstream task scores remain strong—showing transfer operates beyond literal sequence copying.

Which proxies are good?

Tokenizer-based and neural proxies: good transfer. Their compressed outputs are structured and stable enough that similar inputs map to similar compressed sequences.
Gzip proxy: poor transfer. Tiny input changes cause big compressed-output changes; the model can’t learn consistent patterns and performance degrades as gzip share increases.

Neural proxy ‘fuzziness’ is structured:

Neural compression isn’t perfectly invertible (many raw snippets share a compressed segment), but collisions mostly differ only in trivial suffixes like whitespace or indentation. Over 90% share long common prefixes.
This ‘fuzzy but focused’ behavior removes formatting noise while keeping meaning, which may explain its strong transfer and robustness.

Robustness (ReCode):

Byte-level models are known to be more robust to format changes. Proxy models keep most of this advantage—sometimes even more robust than tokenizer baselines.
Biggest gains are on format perturbations (spaces, newlines, indentation): proxy and byte-level stay steady; tokenizer baselines drop sharply.

Surprises and insights:

Despite seeing ~90% compressed inputs, proxy models can do byte-only inference as well as or better than token-only inference. Strong transfer plus more compute at test time for longer byte sequences likely help.
Cross-document attention isn’t needed; blocking it can even help. Transfer happens mostly through shared weights, not by peeking across packed samples.

05Discussion & Limitations

Limitations:

Domain focus: Results are on code; we still need to check natural language, multi-lingual text, logs, or other byte-like modalities.
Transfer vs. compression trade-offs: How far can we push compression before transfer quality suffers? The balance among compression rate, stability, and semantics is not fully mapped.
Small-model regime: Transfer can be weak or negative at small scales; gains appear clearly as models get larger.
Neural proxy decoding: Neural compression is non-invertible by design, so you can’t decode compressed symbols back to raw without the original compressor context.

Required resources:

A data preprocessing pipeline to produce compressed views (tokenizer or neural). Neural proxy needs a small compressor model and a parallel arithmetic-coding pipeline (with entropy-based segmentation) to be practical at scale.
Standard training infrastructure for LMs (no architecture change), but with sentinels and mixed sampling.

When not to use:

If you must do inference directly on tokens (e.g., an existing token-only serving stack) and don’t value byte-level interface, pure tokenization might be simpler.
If your only available proxy is gzip-like and unstable for your domain; results suggest unstable proxies hurt transfer.
If your models are very small and training budget is tiny; transfer gains may be weak.

Open questions:

Generalization: How does proxy compression behave on multilingual text, speech transcripts as bytes, or structured logs?
Optimal mixing: What is the best compressed:raw ratio per scale and domain? Can we schedule it dynamically?
Better proxies: Can we design compressors that are maximally stable and semantically aligned (perhaps content-aware neural proxies) to further boost transfer and robustness?
Architectural synergy: What if we build proxy-awareness into the model (e.g., shared adapters, alignment heads) rather than just in the data pipeline?
Long-context and multi-token prediction: How do these interact with proxy training at very long horizons and very large models?

06Conclusion & Future Work

Three-sentence summary:

Proxy compression trains one model on a mix of compressed and raw bytes so it can learn fast from short sequences and still speak pure bytes at test time.
With simple sentinels and a brief pairing warmup, the model aligns the two views and transfers knowledge from compressed training to byte-level inference.
At larger scales, proxy-trained models match or beat tokenized baselines on code tasks and keep much of the robustness of byte-level models—without any architecture changes.

Main achievement:

Decoupling training-time compression from inference-time interface: you get token-like efficiency and byte-like robustness using a single, simple training recipe.

Future directions:

Explore other domains and languages; design even more semantically stable proxies; learn schedules for mixing ratios; and consider light architectural tweaks to enhance alignment.

Why remember this:

It turns a hard either-or (tokens vs. bytes) into a both-and: fast training with compressed views, simple and robust deployment on bytes. That practical blend—speed plus sturdiness—can make future language models cheaper to train, easier to serve, and harder to break.

Practical Applications

•Train code assistants that handle messy formatting (extra spaces, newlines) without failing.
•Build multilingual chatbots that operate over bytes to reduce tokenizer bias across languages.
•Deploy a single raw-byte interface across products to simplify inference infrastructure.
•Speed up training by using compressed proxies while keeping models robust to input noise.
•Harden systems against adversarial tokenization attacks by avoiding tokenizer lock-in.
•Support unusual or mixed encodings (logs, byte streams, file contents) at inference more easily.
•Improve robustness in IDE integrations where user code formatting varies widely.
•Enable efficient continual learning where new data arrives in raw bytes without re-tokenization.
•Prototype domain-specific neural proxies (e.g., for legal or biomedical text) to boost transfer.
•Combine with long-context strategies to process more documents per unit compute during training.

Version: 1

Key Summary

Why This Research Matters

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes