🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
đŸ›€ïžPaths📚Topics💡Concepts🎮Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models | How I Study AI

MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

Intermediate
Lulu Hu, Wenhu Xiao, Xin Chen et al.3/5/2026
arXiv

Key Summary

  • ‱Multimodal AI models handle text, images, and audio, but their signals are very different in size, which breaks standard low‑bit compression methods.
  • ‱Channel‑wise smoothing that works for text‑only models can crush weaker modalities (like audio) when vision dominates, causing big accuracy drops.
  • ‱This paper finds the root cause, called smoothing misalignment: one shared scaling factor fits the strongest modality and misfits the rest.
  • ‱MASQuant fixes this with two parts: Modality‑Aware Smoothing (learn separate scalings per modality) and Cross‑Modal Compensation (tiny low‑rank add‑ons).
  • ‱A clever SVD whitening step makes the between‑modality differences low‑rank, so tiny matrices can fix them without storing multiple weight copies.
  • ‱Results show big wins: on speech recognition after 4‑bit weights and 8‑bit activations, uniform smoothing gives 77.4% WER vs. 3.8% WER for MASQuant.
  • ‱MASQuant keeps one unified quantized weight tensor (fast and memory‑light) and adds very small modality‑specific corrections only when needed.
  • ‱It works on both vision‑language and omni (vision‑audio‑text) models, often matching 16‑bit accuracy at 8‑bit settings and clearly beating prior PTQ methods.
  • ‱The method needs brief calibration data and a short optimization (about 2 epochs) to learn per‑modality scales and tiny low‑rank factors.
  • ‱Bottom line: treat each modality fairly and correct the gaps efficiently, and multimodal models can be quantized aggressively without falling apart.

Why This Research Matters

Phones, headsets, and home devices need small, fast models that can see, listen, and read without sending private data to the cloud. If quantization unfairly favors one modality, assistants will misread signs, mishear speech, or misanswer questions. MASQuant keeps one compact model but treats each modality fairly, so you get accuracy close to full precision without the cost. This helps accessibility tools read text in photos while transcribing speech reliably on‑device. It also benefits education apps, AR glasses, and robots that must run in real time with limited battery and memory. As multimodal AI spreads to everyday gadgets, methods like MASQuant make the difference between a neat demo and a dependable assistant.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how at a party, some people talk very loudly and others speak softly? If you set the music volume to suit the loudest person, the quiet voices disappear.

đŸ„Ź The Concept (Post‑Training Quantization, PTQ): PTQ is a way to shrink a trained model by storing and computing with fewer bits while trying to keep answers the same.

  • How it works:
    1. Take a trained model.
    2. Pick bit widths for weights and activations (like choosing how much to round numbers).
    3. Use a small calibration set to set scales so rounding hurts as little as possible.
  • Why it matters: Without PTQ, big models won’t fit on many devices or run fast enough.
  • Anchor: Like packing a suitcase more tightly so it fits in the trunk without leaving your favorite jacket behind.

🍞 Hook: Imagine a student who can read books, look at photos, and listen to podcasts.

đŸ„Ź The Concept (Multimodal Large Language Models, MLLMs): MLLMs understand and connect text, images, and sounds.

  • How it works:
    1. Each modality (text, vision, audio) becomes tokens (numbers) via encoders.
    2. A shared transformer blends them to answer questions or follow instructions.
    3. The model outputs words, and sometimes other signals.
  • Why it matters: Without MLLMs, AI can’t answer questions like “What does this sign say?” from a picture or “Who is speaking?” from audio.
  • Anchor: Asking, “What brand is in white letters with a red background?” and the model says “Coca‑Cola” because it read the logo in the image and understood your words.

🍞 Hook: Think of a blender that evens out chunks so your smoothie is sippable.

đŸ„Ź The Concept (Channel‑wise Smoothing): Channel‑wise smoothing scales features per channel to spread out spikes before quantization.

  • How it works:
    1. Measure how big each channel’s activations get.
    2. Scale down big channels and scale weights up to keep computation the same.
    3. Quantize; spikes are tamed, rounding hurts less.
  • Why it matters: Without smoothing, a few outlier channels force large ranges and waste bits, making errors bigger.
  • Anchor: If one strawberry chunk is huge, you reduce it so the whole drink has a nice even texture.

🍞 Hook: Picture three flashlights: the vision one is super bright, text is medium, audio is dim. If one dimmer switch controls all three, you set it for the bright one and lose the dim beams.

đŸ„Ź The Concept (Activation Magnitude Disparities): Different modalities naturally produce activations of very different sizes; vision can be 10–100×larger100× larger100×larger than text or audio.

  • How it works:
    1. Each modality passes through the same layers.
    2. Vision often dominates the max values used to set scales.
    3. Audio/text get over‑shrunk, losing detail.
  • Why it matters: If scaling matches only the strongest modality, others get crushed and quantization errors explode.
  • Anchor: In the paper, when uniform smoothing is used at 4‑bit weights/8‑bit activations, audio WER shoots up to 77.4%—the quiet voice was drowned out.

The world before: Text‑only LLMs could be quantized well with methods like SmoothQuant and AWQ. People assumed these tricks would generalize to multimodal settings.

The problem: One shared smoothing factor per channel is pulled toward the dominant modality (often vision). This smoothing misalignment hurts the weaker ones (often audio) badly.

Failed attempts: Simply re‑tuning a global hyperparameter or balancing losses helps a little but can’t fully protect non‑dominant modalities; learning a single set of factors still leaves them mismatched.

The gap: We need modality‑specific smoothing (fairness) but also one shared quantized weight tensor (efficiency). Storing separate quantized weights per modality would defeat the purpose.

Real stakes: Without a fix, on‑device assistants that see, listen, and read can’t be fast and accurate. That affects accessibility tools (reading signs for the visually impaired), education apps, smart cameras, and voice agents that must run privately and cheaply on edge devices.

02Core Idea

🍞 Hook: You know how you adjust bike seats to fit each rider but still use the same bike frame for everyone?

đŸ„Ź The Concept (The Aha!): Learn a custom “fit” (smoothing) for each modality but keep one shared set of quantized weights, adding only tiny modality‑specific low‑rank corrections when needed.

  • How it works (high level):
    1. Learn separate per‑modality smoothing factors so no one gets over‑ or under‑scaled.
    2. Quantize a single base (text‑smoothed) weight tensor to keep memory low and speed high.
    3. For other modalities, add tiny low‑rank corrections computed with SVD whitening that turns differences into compressible pieces.
  • Why it matters: Without this, you either crush weaker modalities or store multiple weight copies. This idea protects quality and preserves efficiency.
  • Anchor: One frame, different seat heights. The frame is the shared quantized weights; seat height is modality‑aware smoothing; a thin cushion is the low‑rank correction.

Three analogies:

  1. Glasses with different prescriptions: One pair of frames (shared weights), clip‑on lenses per person (low‑rank corrections), and each person’s eye test (modality‑aware smoothing).
  2. Cooking for allergies: One base dish (shared weights), per‑guest finishing touches (low‑rank), and per‑ingredient prep (modality‑aware smoothing) so no one gets sick.
  3. Classroom with stools: Same stool design (shared weights), adjustable legs (smoothing), small pads for comfort (low‑rank) so short and tall students sit well.

Before vs After:

  • Before: Single smoothing fits the loudest modality; weaker ones collapse under quantization. To fix, you’d need multiple weight sets (inefficient) or accept big errors.
  • After: Each modality gets its own smoothing; one shared weight tensor remains; low‑rank add‑ons bridge the last gap efficiently.

🍞 Hook: Imagine sorting a messy pile of socks by color so it’s easier to pack them tightly.

đŸ„Ź The Concept (SVD Whitening): SVD whitening rotates and scales features so their covariance becomes the identity; differences become easier to compress.

  • How it works:
    1. Compute covariance of (already smoothed) activations for a modality.
    2. Do SVD to get a transform T that whitens them.
    3. In this whitened space, the modality gap looks low‑rank, so a tiny SVD truncation captures it well.
  • Why it matters: Without whitening, the difference matrix isn’t reliably low‑rank, and small corrections won’t work.
  • Anchor: Ironing wrinkled fabric before folding makes the stack flatter; whitening makes differences easier to compress.

🍞 Hook: Think of summarizing a big book into a short outline that still keeps the plot.

đŸ„Ź The Concept (Low‑Rank Approximation): Replace a big correction matrix with two skinny matrices whose product captures the most important directions.

  • How it works:
    1. In whitened space, take SVD of the difference.
    2. Keep only the top r singular vectors and values.
    3. Undo whitening to get a small, accurate correction.
  • Why it matters: Without low‑rank, you’d need a full matrix per modality, exploding memory and compute.
  • Anchor: Instead of carrying the whole encyclopedia, carry a few pages of key summaries.

🍞 Hook: The chef seasons sweet and salty dishes differently, but serves them from the same kitchen.

đŸ„Ź The Concept (Modality‑Aware Smoothing): Learn separate channel scales per modality to avoid one size fitting none.

  • How it works:
    1. Initialize scales from each modality’s activation and weight stats.
    2. Optimize scales directly to minimize reconstruction loss for that modality.
    3. Keep the best scales per modality.
  • Why it matters: Without this, unified scales follow the dominant modality and over‑shrink the rest.
  • Anchor: Vision gets its spice level, audio gets its own, text gets its own—no one’s taste buds get numbed.

🍞 Hook: A universal remote has special buttons for each device, but it’s still one remote.

đŸ„Ź The Concept (Cross‑Modal Compensation): Use the shared text‑smoothed quantized weights for everyone; when a non‑text modality comes, add a tiny low‑rank patch computed via whitening.

  • How it works:
    1. Compute the ideal per‑modality smoothed weight minus the shared weight.
    2. Whiten, do truncated SVD, then unwhiten to get a small correction.
    3. Apply the correction only when that modality appears.
  • Why it matters: Without it, you’d either store many weight copies or lose accuracy.
  • Anchor: One jacket, detachable thin liners for cold days; you add the liner only when needed.

Why it works (intuition):

  • Fair scaling protects each modality’s signal before rounding.
  • Whitening lines up the data axes so modality gaps concentrate in a few directions.
  • Low‑rank keeps only those few directions, giving big accuracy for tiny cost.

Building blocks:

  • PTQ basics (scales, rounding).
  • Computational invariance (re‑arranging math so scaling cancels between activations and weights).
  • SVD whitening (make features independent and equally scaled).
  • Low‑rank SVD truncation (keep the top r directions).
  • Modality‑aware smoothing and small residual patches.

03Methodology

At a high level: Multimodal inputs → Modality‑Aware Smoothing (per modality) → Shared quantized weights (text‑smoothed base) → For non‑text inputs, Cross‑Modal Compensation (whiten → low‑rank patch → unwhiten) → Output.

Step 1. Quantization basics and computational invariance 🍞 Hook: Imagine rounding prices to the nearest dollar but scaling the bill first so rounding hurts less, then un‑scaling later so the total stays the same.

đŸ„Ź The Concept (Quantization operator and invariance): We round numbers using scales and zero‑points, and we can pre‑scale activations and post‑scale weights so the math outcome stays the same.

  • What happens:
    1. Quantize tensor xxx to x^N=Q(x)\hat{x}_N = Q(x)x^N​=Q(x) using a scale Δ\DeltaΔ and zero‑point zzz within [qmin,qmax][q_{min}, q_{max}][qmin​,qmax​].
    2. For a linear layer Y=XWY = XWY=XW, we can rewrite it as Y=(XS−1)(SW)Y = (XS^{-1})(SW)Y=(XS−1)(SW) so scaling cancels out.
  • Why this step exists: Without it, spikes force big ranges and rounding errors grow; with it, we tame spikes before rounding.
  • Example with actual data:
    • Quantization: Q(x)=(clamp(⌊x/Δ+z⌉,qmin,qmax)−z)⋅ΔQ(x) = \big(\text{clamp}(\lfloor x/\Delta + z \rceil, q_{min}, q_{max}) - z\big)\cdot \DeltaQ(x)=(clamp(⌊x/Δ+z⌉,qmin​,qmax​)−z)⋅Δ. Suppose x=2.3x = 2.3x=2.3, Δ=0.5\Delta = 0.5Δ=0.5, z=0z=0z=0, qmin=−8q_{min}=-8qmin​=−8, qmax=7q_{max}=7qmax​=7. Then x/Δ=4.6x/\Delta = 4.6x/Δ=4.6, round to 555, clamp to 555, so Q(x)=5⋅0.5=2.5Q(x)=5\cdot0.5=2.5Q(x)=5⋅0.5=2.5.
    • Invariance: Let X=(24)X = \begin{pmatrix}2 & 4\end{pmatrix}X=(2​4​), W=(13)W = \begin{pmatrix}1 \\ 3\end{pmatrix}W=(13​), so Y=2⋅1+4⋅3=14Y = 2\cdot1 + 4\cdot3 = 14Y=2⋅1+4⋅3=14. Choose S=diag(2,1)S=\text{diag}(2,1)S=diag(2,1). Then XS−1=(14)XS^{-1} = \begin{pmatrix}1 & 4\end{pmatrix}XS−1=(1​4​) and SW=(23)SW = \begin{pmatrix}2 \\ 3\end{pmatrix}SW=(23​). New product: 1⋅2+4⋅3=141\cdot2 + 4\cdot3 = 141⋅2+4⋅3=14 (unchanged).

🍞 Anchor: We scale before rounding to protect details, then unscale so the answer stays the same.

Step 2. Spot the root cause: smoothing misalignment 🍞 Hook: If everyone shares one sweater size, the tallest person decides the size, and smaller kids sink in their sleeves.

đŸ„Ź The Concept (Unified smoothing fails across modalities): One factor per channel is pulled by the biggest modality, over‑shrinking the others.

  • What happens:
    1. Compute sis_isi​ from max activations and weights; the largest modality dominates.
    2. Non‑dominant modalities get too much scaling and lose detail after rounding.
  • Why this step matters: If we don’t fix this, audio/text accuracy collapses when vision dominates.
  • Example: Suppose vision channel range is Riv=100R^v_i=100Riv​=100 and audio is Ria=5R^a_i=5Ria​=5. A unified scale tracks 100100100, so audio values are divided by a much larger sis_isi​ than they need, becoming tiny and lossy.

🍞 Anchor: One big sweater makes smaller kids cold; unified smoothing makes weaker modalities inaccurate.

Step 3. Modality‑Aware Smoothing (MAS) 🍞 Hook: You adjust helmet straps differently for biking, skating, and skiing.

đŸ„Ź The Concept (Per‑modality scales): Learn SmS_mSm​ for each modality mmm so no one gets over‑ or under‑scaled.

  • What happens:
    1. Initialize Sm=diag(sm)S_m=\text{diag}(s^m)Sm​=diag(sm) with sim=s⋅max⁡t∣xt,im∣/max⁡j∣wj,i∣s^m_i = s \cdot \max_t |x^m_{t,i}|/\max_j |w_{j,i}|sim​=s⋅maxt​∣xt,im​∣/maxj​∣wj,i​∣.
    2. Optimize {Sm}\{S_m\}{Sm​} to minimize reconstruction loss: ∑mλm⋅LMAE(Sm,Xm,W)\sum_m \lambda_m \cdot L_{MAE}(S_m, X_m, W)∑m​λm​⋅LMAE​(Sm​,Xm​,W).
    3. Now each modality is fairly scaled for quantization.
  • Why this step exists: Without MAS, the dominant modality dictates scales and hurts the rest.
  • Example with actual data:
    • Initialize: Let max⁥t∣xt,itext∣=10\max_t|x^{text}_{t,i}|=10maxt​∣xt,itext​∣=10, max⁥t∣xt,ivision∣=80\max_t|x^{vision}_{t,i}|=80maxt​∣xt,ivision​∣=80, max⁥j∣wj,i∣=20\max_j|w_{j,i}|=20maxj​∣wj,i​∣=20, s=1s=1s=1. Then sitext=10/20=0.5s^{text}_i=10/20=0.5sitext​=10/20=0.5, sivision=80/20=4.0s^{vision}_i=80/20=4.0sivision​=80/20=4.0.
    • Optimize: If λtext=λvision=1\lambda_{text}=\lambda_{vision}=1λtext​=λvision​=1, the learner adjusts StextS_{text}Stext​ and SvisionS_{vision}Svision​ to reduce ∄Q(XmSm−1)Q(SmW)−XmW∄\|Q(X_{m}S_m^{-1})Q(S_mW)-X_mW\|∄Q(Xm​Sm−1​)Q(Sm​W)−Xm​W∄ per modality.

🍞 Anchor: Different sports, different strap settings; different modalities, different smoothing.

Step 4. Keep one shared quantized weight 🍞 Hook: One shared family calendar on the fridge keeps everyone in sync.

đŸ„Ź The Concept (Computational invariance with a single base): Store only Q(StW)Q(S_tW)Q(St​W) (text‑smoothed). For other modalities, add a tiny fix so everyone can still share the same base.

  • What happens:
    1. Base: quantize Q(StW)Q(S_tW)Q(St​W).
    2. For m≠tm\neq tm=t, compute the residual ΔW=SmW−Q(StW)\Delta W = S_mW - Q(S_tW)ΔW=Sm​W−Q(St​W) in the XmSm−1X_mS_m^{-1}Xm​Sm−1​ space.
  • Why this step exists: Without a single base, you need multiple weight copies, wasting memory.
  • Example with actual data: Suppose StW=(2001)S_tW = \begin{pmatrix}2 & 0\\0 & 1\end{pmatrix}St​W=(20​01​), Q(StW)=(2001)Q(S_tW)=\begin{pmatrix}2 & 0\\0 & 1\end{pmatrix}Q(St​W)=(20​01​), and SvW=(2.20.1−0.11.0)S_vW=\begin{pmatrix}2.2 & 0.1\\-0.1 & 1.0\end{pmatrix}Sv​W=(2.2−0.1​0.11.0​). Then ΔW=(0.20.1−0.10.0)\Delta W=\begin{pmatrix}0.2 & 0.1\\-0.1 & 0.0\end{pmatrix}ΔW=(0.2−0.1​0.10.0​).

🍞 Anchor: One fridge calendar (shared weight), with small sticky notes (corrections) for special events.

Step 5. SVD whitening makes modality gaps low‑rank 🍞 Hook: Straighten spaghetti before measuring it; it packs better.

đŸ„Ź The Concept (Whitening then low‑rank SVD): Whiten XmSm−1X_mS_m^{-1}Xm​Sm−1​ so TTT makes features balanced; then TΔWT\Delta WTΔW becomes low‑rank and is easy to approximate.

  • What happens:
    1. Compute PΛP⊀=SVD((XmSm−1)⊀(XmSm−1))P\Lambda P^\top = \text{SVD}((X_mS_m^{-1})^\top (X_mS_m^{-1}))PΛP⊀=SVD((Xm​Sm−1​)⊀(Xm​Sm−1​)).
    2. Set T=(PΛ1/2)⊀T=(P\Lambda^{1/2})^\topT=(PΛ1/2)⊀, so (XmSm−1)T−1(X_mS_m^{-1})T^{-1}(Xm​Sm−1​)T−1 is orthonormal.
    3. Do SVD(TΔW)=UÎŁV⊀\text{SVD}(T\Delta W)=U\Sigma V^\topSVD(TΔW)=UÎŁV⊀, keep top rrr to get UrÎŁrVr⊀U_r\Sigma_rV_r^\topUr​Σr​Vr⊀​.
    4. Unwhiten: ΔW≈L1L2\Delta W\approx L_1L_2ΔW≈L1​L2​ with L1=T−1UrL_1=T^{-1}U_rL1​=T−1Ur​, L2=ÎŁrVr⊀L_2=\Sigma_rV_r^\topL2​=ÎŁr​Vr⊀​.
  • Why this step exists: Without whitening, ΔW\Delta WΔW isn’t reliably compressible; with whitening, a tiny rank works.
  • Examples with actual data:
    • Whitening: Let (XmSm−1)⊀(XmSm−1)=(9004)(X_mS_m^{-1})^\top (X_mS_m^{-1})=\begin{pmatrix}9 & 0\\0 & 4\end{pmatrix}(Xm​Sm−1​)⊀(Xm​Sm−1​)=(90​04​). Then P=IP=IP=I, Λ=(9004)\Lambda=\begin{pmatrix}9 & 0\\0 & 4\end{pmatrix}Λ=(90​04​), so T=(PΛ1/2)⊀=(3002)T=(P\Lambda^{1/2})^\top=\begin{pmatrix}3 & 0\\0 & 2\end{pmatrix}T=(PΛ1/2)⊀=(30​02​).
    • Low‑rank: With TΔW=(0.60.3−0.20)T\Delta W=\begin{pmatrix}0.6 & 0.3\\-0.2 & 0\end{pmatrix}TΔW=(0.6−0.2​0.30​), the top rank‑1 SVD might give Ur=(10)U_r=\begin{pmatrix}1\\0\end{pmatrix}Ur​=(10​), ÎŁr=0.67\Sigma_r=0.67ÎŁr​=0.67, Vr⊀=(0.90.44)V_r^\top=\begin{pmatrix}0.9 & 0.44\end{pmatrix}Vr⊀​=(0.9​0.44​) (illustrative). Then L1=T−1Ur=(1/30)L_1=T^{-1}U_r=\begin{pmatrix}1/3\\0\end{pmatrix}L1​=T−1Ur​=(1/30​), L2=0.67⋅(0.90.44)=(0.6030.295)L_2=0.67\cdot\begin{pmatrix}0.9 & 0.44\end{pmatrix}=\begin{pmatrix}0.603 & 0.295\end{pmatrix}L2​=0.67⋅(0.9​0.44​)=(0.603​0.295​), so ΔW≈L1L2\Delta W\approx L_1L_2ΔW≈L1​L2​.

🍞 Anchor: Straight spaghetti (whitening) plus a short bundle (rank‑r) takes little space and still feeds everyone.

Step 6. Final inference rule 🍞 Hook: Default settings for everyone; small add‑ins only when needed.

đŸ„Ź The Concept (Unified compute with optional tiny patches): Use the base Q(StW)Q(S_tW)Q(St​W) for text; for others, add XmSm−1L1L2X_mS_m^{-1}L_1L_2Xm​Sm−1​L1​L2​.

  • What happens:
    1. For text: Y=Q(XtSt−1)⋅Q(StW)Y = Q(X_tS_t^{-1})\cdot Q(S_tW)Y=Q(Xt​St−1​)⋅Q(St​W).
    2. For m≠tm\neq tm=t: Y=Q(XmSm−1)⋅Q(StW)+XmSm−1⋅L1L2Y = Q(X_mS_m^{-1})\cdot Q(S_tW) + X_mS_m^{-1}\cdot L_1L_2Y=Q(Xm​Sm−1​)⋅Q(St​W)+Xm​Sm−1​⋅L1​L2​.
  • Why this step exists: Keeps decoding fast and memory small while preserving accuracy across modalities.
  • Example with actual data: If Q(XmSm−1)Q(StW)Q(X_mS_m^{-1})Q(S_tW)Q(Xm​Sm−1​)Q(St​W) gives (52)\begin{pmatrix}5 & 2\end{pmatrix}(5​2​) and XmSm−1L1L2X_mS_m^{-1}L_1L_2Xm​Sm−1​L1​L2​ adds (0.40.1)\begin{pmatrix}0.4 & 0.1\end{pmatrix}(0.4​0.1​), the final is (5.42.1)\begin{pmatrix}5.4 & 2.1\end{pmatrix}(5.4​2.1​).

Secret sauce:

  • Fairness first (per‑modality smoothing), then efficiency (one base weight), then precision sprinkles (tiny low‑rank patches made possible by whitening).

04Experiments & Results

The test: The authors measured how well quantized multimodal models perform on tasks spanning images (OCRBench, TextVQA, VizWiz, MMMU, ScienceQA), audio (LibriSpeech, WenetSpeech via WER), and mixed reasoning (OmniBench). They compared various bit settings like W8A8 and aggressive W4A8/W4A6.

The competition: Strong PTQ baselines included SmoothQuant (SQ), AWQ, and MBQ (a modality‑balanced method). A simple round‑to‑nearest (RTN) served as a lower bound.

Scoreboard with context:

  • Vision‑Language (Qwen2.5‑VL‑3B/7B):
    • At W8A8, MASQuant essentially matches FP16. That’s like getting an A when full‑precision gets an A too.
    • At W4A8, where others stumble, MASQuant lifts average accuracy meaningfully over SQ and MBQ. For example, MASQuant improves averages by several points and steadies difficult tasks like MMMU.
  • Omni (vision‑audio‑text, Qwen2.5‑Omni‑3B/7B):
    • Audio is the canary in the coal mine. With uniform smoothing at W4A8, LibriSpeech WER explodes from about 3.9% to 77.4% (from near‑perfect to unusable). MASQuant brings it back near FP16 (about 3.6–3.8%). That’s turning an F into an A in one move.
    • Vision‑text tasks also benefit: MASQuant either matches or beats prior methods across MMMU and OmniBench, indicating stability across modalities.

Surprising findings:

  • The more modalities you mix, the worse unified smoothing behaves; dominance intensifies. Audio often loses worst because its activations are smallest.
  • Whitening dramatically lowers the effective rank of cross‑modal weight differences, so very small ranks suffice. In SQNR terms, whitening lets the low‑rank curve beat a non‑whitened baseline at tiny ranks (e.g., 0.08), achieving good signal quality with minimal overhead.

Making numbers meaningful:

  • A collapse from 3.9% WER to 77.4% WER is like going from understanding nearly every word to almost none. MASQuant restoring WER near 3–4% means the model is again reliable for transcription.
  • On vision‑language benchmarks, a few percentage points uplift at low bits is significant because small errors propagate across multi‑step reasoning; MASQuant’s steady gains reflect healthier signals for text and vision jointly.

Ablations and efficiency:

  • Per‑modality smoothing is necessary: removing it causes massive audio failure and lower vision‑text accuracy.
  • Equal loss weights across modalities work best; skewing them can tank hard tasks like MMMU.
  • Training for just 2 epochs during calibration offers a great trade‑off: perplexity and accuracy peak around there.
  • A custom fused CUDA kernel keeps runtime overhead modest: MASQuant is within about 5–10% of MBQ’s prefill latency while giving better quality, and still far faster and smaller than FP16 (about 2.5–3.3×speedups3× speedups3×speedups reported).

Bottom line: Across both dual‑modal and tri‑modal settings and across 4‑ to 8‑bit regimes, MASQuant consistently protects weaker modalities without duplicating weights, and often matches near‑FP16 performance at 8‑bit while staying strong at 4‑bit where others falter.

05Discussion & Limitations

Limitations:

  • Extra calibration compute: Learning separate SmS_mSm​ and doing whitening/SVD adds overhead versus the simplest PTQ. It’s short (a couple epochs) but not free.
  • Memory vs rank trade‑off: While low‑rank patches are tiny, very tight memory budgets may require tuning the rank rrr carefully per layer/modality.
  • Base‑modality choice: Using text as the base simplifies decoding, but if another modality dominates most workloads, one might revisit this choice.
  • Distribution shifts: If future data has very different modality statistics (e.g., much noisier audio), SmS_mSm​ and low‑rank patches may need refreshing.

Required resources:

  • Small calibration sets per modality (text, vision, audio) to learn SmS_mSm​ and estimate whitening transforms.
  • SVD computations per layer per non‑base modality; practical implementations rely on efficient GPU kernels and batching.

When not to use:

  • Ultra‑tiny models or single‑modality tasks where unified smoothing already works fine and every extra kernel hurts simplicity.
  • Online, on‑device recalibration scenarios with no access to any calibration data at all; MASQuant expects at least a small set.

Open questions:

  • Adaptive ranks: Can we auto‑choose rank per layer from effective rank estimates to minimize latency and memory without losing accuracy?
  • Streaming and long‑context: How do whitening and low‑rank patches behave under streaming inputs or extremely long contexts?
  • More modalities: Can the same idea extend to video, sensor streams, or 3D point clouds with different time/space scales?
  • Robustness: How does modality‑aware smoothing interact with noisy inputs or adversarial perturbations?

06Conclusion & Future Work

Three‑sentence summary: This paper shows that standard channel‑wise smoothing breaks in multimodal models because one shared factor follows the strongest modality and crushes the others. MASQuant fixes this by learning separate per‑modality smoothing while still keeping one shared quantized weight tensor, using SVD‑whitened low‑rank patches as tiny modality‑specific corrections. The result is strong, stable accuracy across text, vision, and audio at low bit widths, with modest overhead and simple deployment.

Main achievement: Proving and exploiting that cross‑modal differences become low‑rank after whitening, enabling a single shared quantized weight plus tiny modality‑specific patches that recover accuracy without duplicating weights.

Future directions: Auto‑tuning ranks per layer, exploring different base modalities, extending to more modalities (video, sensors), and studying robustness to noise and distribution shifts. Engineering‑wise, deeper kernel fusion and sparsity could bring further speed and memory wins.

Why remember this: MASQuant turns a hard either‑or (accuracy vs efficiency) into a both‑and for multimodal quantization: fairness to each modality plus one compact model. It’s a practical recipe for on‑device assistants that can see, listen, and read—fast, small, and still smart.

Practical Applications

  • ‱On‑device visual question answering that remains accurate after compression.
  • ‱Assistive reading of signs, menus, and labels for the visually impaired on smartphones.
  • ‱Reliable on‑device speech transcription in noisy or varied environments after low‑bit quantization.
  • ‱Smart camera analytics (OCR, scene text understanding) with low latency and low power.
  • ‱AR glasses that read and describe surroundings while keeping corrections tiny and fast.
  • ‱Robotics perception stacks that fuse audio, vision, and text commands under tight compute budgets.
  • ‱Classroom tools that analyze diagrams and spoken questions offline, preserving privacy.
  • ‱Call‑center QA assistants that process screenshots and audio snippets locally on laptops.
  • ‱Multimodal note‑taking apps (audio + whiteboard photos) running smoothly on tablets.
  • ‱Edge IoT sensors combining sound and image cues with minimal memory footprints.
#post‑training quantization#multimodal LLM#channel‑wise smoothing#smoothing misalignment#computational invariance#SVD whitening#low‑rank approximation#activation disparities#AWQ#SmoothQuant#MBQ#SQNR#WER#Qwen2.5‑VL#Qwen2.5‑Omni
Version: 1

Notes

0/2000
Press Cmd+Enter to submit