QP-OneModel: A Unified Generative LLM for Multi-Task Query Understanding in Xiaohongshu Search
Key Summary
- •Search engines on social apps used to rely on many separate mini-models that often misunderstood slang and were hard to keep updated.
- •QP-OneModel turns five different query-understanding jobs into one smart text generator that outputs a single structured JSON answer.
- •It learns in three steps: first from lots of rough data, then from fresh expert labels, and finally from rewards that check each task’s correctness.
- •A domain-specific backbone (RedOne) helps it speak social-media “slang,” closing the gap between general books/web text and SNS language.
- •It adds a new signal called an Intent Description—a short, clear sentence about what the user really wants—which boosts rewriting and ranking.
- •Offline, it beats strong BERT pipelines by 7.35% overall, with big gains in hard tasks like NER (+9.01% F1) and Term Weighting (+9.31% F1).
- •Even the small 0.6B model tops the old pipeline, showing the method—not just size—drives the wins.
- •On unseen tasks, it outperforms a much larger 32B model by 7.60% accuracy in few-shot Document Intent.
- •Online A/B tests show better relevance (DCG 0/1 improves by 0.21%) and higher user retention (+0.044%).
- •The unified design reduces pipeline errors and maintenance, while the reward-based training helps the model internalize tricky business rules.
Why This Research Matters
On social platforms, users type short, slangy queries that change with trends, so search engines need to understand intent quickly and accurately. By unifying all query-understanding tasks into one model, QP-OneModel reduces pipeline mistakes and maintenance headaches. Its domain grounding helps it “speak” SNS language, so even cryptic queries like product codes get the right results. The new Intent Description gives downstream systems a clear, human-like summary to guide rewriting and ranking. These improvements translate into real gains—better relevance, fewer dead ends, and small but meaningful boosts in user engagement and retention. For everyday users, it simply feels like the app “gets” what they meant.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how kids in a group project sometimes each do their own part, but when they try to glue everything together at the end, things don’t quite fit? That’s how many search systems on social apps used to work.
Sandwich 1 – Query Processing (QP)
- Top Bread (Hook): Imagine you type “best budget hiking shoes” into a search box and hope the app understands you want shoe recommendations, not a history of hiking.
- Filling (What it is): Query Processing is the set of steps that turns your messy, short query into clean signals a search engine can actually use.
- How it works (steps):
- Spot names and brands (Named Entity Recognition).
- Break the text into words if the language doesn’t have spaces (Word Segmentation).
- Decide which words matter most (Term Weighting).
- Tag the query with categories (Query Taxonomy).
- Optionally explain the user’s goal in a sentence (Intent Description).
- Why it matters: Without QP, the search engine guesses blindly and can return off-target results.
- Bottom Bread (Anchor): For “canmake cream blush 19 asian-style makeup,” QP extracts the brand, product, shade, and style so results match the makeup look you want.
The World Before: In industry, each QP step was usually handled by its own discriminative model (like a BERT tagger or classifier). This “many little boxes in a row” pipeline often struggled with:
- Limited semantic understanding: Colloquial phrases, typos, or emojis common on social apps confused models trained mostly on neat, formal text.
- Error propagation: If early steps made a mistake (like missing a brand), later steps (like weighing terms) got worse.
- High maintenance: When business rules or category lists changed, teams had to retrain or retune multiple models, slowing iteration.
- Domain shift: Social media slang evolves lightning fast; yesterday’s labels lag behind today’s trends.
Sandwich 2 – Named Entity Recognition (NER)
- Top Bread: You know how you can spot a friend’s name in a crowded chat?
- Filling (What it is): NER finds important names in a query, like brands, products, or people.
- How it works: 1) Scan the text, 2) mark character spans, 3) assign types (brand/product/etc.), 4) keep only precise matches.
- Why it matters: If you miss the brand or product, the search might show generic results.
- Bottom Bread: In “retro-x jacket,” catching that “retro-x” refers to a Patagonia product line steers the search toward the right items.
The Problem: Even with LLMs getting popular, many attempts still treated each QP task separately or fine-tuned general models that didn’t fully “speak” SNS slang or follow strict business definitions. With limited high-quality, up-to-date labels, standard fine-tuning could overfit to surface patterns and fail on tricky edge cases.
Failed Attempts:
- Isolated generative heads: Generating each task separately missed the natural connections between tasks.
- Plain SFT (supervised fine-tuning): Memorized examples without internalizing the rules; broke on changing slang.
- Generic LLMs without grounding: Performed well on open-domain text but stumbled on short, cryptic SNS queries like codes (“1c1”).
Sandwich 3 – Word Segmentation
- Top Bread: Picture cutting a long bread loaf into slices before making sandwiches.
- Filling (What it is): Splitting a character stream into words when the language lacks spaces.
- How it works: 1) Look for known patterns, 2) test boundaries, 3) split into terms, 4) keep consistent rules.
- Why it matters: If you slice wrong, later steps (like importance) don’t match the real words.
- Bottom Bread: “creamblush19” must become “cream blush | 19” or term weights can’t be assigned correctly.
The Gap: The field needed a single model that:
- Handles all QP tasks together to reduce errors and share context.
- Learns SNS language and business rules progressively, not all at once.
- Produces a human-readable Intent Description to strengthen downstream rewriting and ranking.
Sandwich 4 – Term Weighting
- Top Bread: Imagine packing a suitcase—some items (like your jacket) matter more than others (like stickers).
- Filling (What it is): Assigning each word a level 0–3 for how much it carries the core intent.
- How it works: 1) Segment words, 2) consider entities, 3) score each term’s importance, 4) use scores to guide matching.
- Why it matters: If “blush” and “the” get equal weight, search goes off track.
- Bottom Bread: In “cream blush 19,” “blush” gets 3 (core), “19” maybe 2 (shade), while stop words get 0.
Real Stakes: Better QP means more relevant results, fewer dead ends, and happier users. In real A/B tests on Xiaohongshu, the new approach improved ranking precision (lower DCG 0/1 by 0.21%) and even nudged next-day user return up (+0.044%). For users, that means finding the right content faster—like getting the exact makeup tutorial you need.
Sandwich 5 – Query Taxonomy
- Top Bread: Think of placing library books on the right shelf so people can find them.
- Filling (What it is): Tagging a query with one or more categories, ordered by importance.
- How it works: 1) Predict multiple labels, 2) rank them, 3) treat Top-1 as the main intent, 4) keep others as helpful context.
- Why it matters: Wrong shelf, wrong results.
- Bottom Bread: “canmake cream blush 19” goes under Beauty/Makeup first, not Cooking or Travel.
02Core Idea
Aha! Key insight in one sentence: Turn every query-understanding job into one structured text-generation task, train it in three progressive stages (broad→precise→rewarded), and add a plain-language Intent Description that supercharges downstream search.
Analogy 1 (Swiss Army Knife): Instead of carrying separate tools for NER, segmentation, weighting, and taxonomy, use one sturdy tool with fold-out parts so all steps share the same handle and context. Analogy 2 (One Chef, Full Meal): Rather than four cooks each making a dish without tasting the others, one chef prepares the whole meal, seasoning each step with knowledge of the rest. Analogy 3 (Teacher with Rubrics): A teacher grades math, reading, and science using a single, consistent rubric and improves the class using rewards that reflect each subject’s goals.
Before vs After:
- Before: Separate models, fragile pipelines, slow updates, weaker on slang and long-tail edge cases.
- After: One generative model emits a single JSON answer for all tasks, understands SNS language, and learns business rules deeply via multi-reward reinforcement.
Why It Works (Intuition):
- Shared context: Early decisions (like entities) become part of the story the model tells itself while generating later parts (like weights and categories).
- Progressive learning: First soak up broad patterns (even from noisy logs), then tune to fresh expert labels, then refine logic using direct task rewards—like training wheels, then practice, then a timed race.
- Grounding in SNS: A domain backbone (RedOne) helps understand fast-changing slang, emojis, and product codes.
- Extra semantic boost: Intent Descriptions summarize the user’s goal in everyday words, which downstream systems digest easily.
Building Blocks (with Sandwiches):
Sandwich 6 – Unified Generative Framework
- Top Bread: You know how a storyteller can describe a whole scene smoothly, not in chopped-up bits?
- Filling (What it is): A single text-to-text model that generates one structured JSON containing all QP results.
- How it works: 1) Read the prompt with rules and context, 2) generate entities, 3) then segments, 4) then term weights, 5) then categories, 6) then a clear intent sentence.
- Why it matters: If each part is done alone, they can disagree and pass along errors; unified generation keeps them consistent.
- Bottom Bread: One pass outputs {entities, segments, weights, category, intent_desc} that agree with each other.
Sandwich 7 – Business-Aware Prompt
- Top Bread: Picture a recipe card with cooking rules plus notes from taste-testers.
- Filling (What it is): A detailed prompt that includes task instructions, company rules, user rewrite history, and example notes from the platform.
- How it works: 1) Load definitions and examples, 2) attach recent user queries, 3) retrieve top related notes, 4) ask for unified JSON.
- Why it matters: Without rules and live context, the model might drift or miss trends.
- Bottom Bread: For “asian-style makeup,” including recent notes and rules helps the model pick the right category and write a precise intent.
Sandwich 8 – Progressive Three-Stage Alignment
- Top Bread: Like learning to bike: start with training wheels, practice in a safe lot, then ride with a coach timing you.
- Filling (What it is): A 3-step training plan—(1) Knowledge Injection, (2) Target Alignment, (3) Multi-Reward RL—to move from broad know-how to precise, rule-following behavior.
- How it works: 1) Mix small expert data with large pseudo-labeled logs, 2) fine-tune only on fresh, clean unified labels, 3) optimize with rewards matching each task’s metric.
- Why it matters: Jumping straight to final performance without foundations leads to brittle behavior.
- Bottom Bread: This path is why the model jumps +9% F1 on NER and Term Weighting vs. the old pipeline.
Sandwich 9 – Reinforcement Learning with Multi-Reward
- Top Bread: Think of earning badges in scouts: camping, first aid, and navigation each have their own score.
- Filling (What it is): Training that improves the model by maximizing a combined reward across tasks (NER/Seg/Weights/Taxonomy), each using its true metric.
- How it works: 1) Generate multiple answers, 2) score each sub-task with its metric, 3) weight scores by business importance, 4) update the policy (GRPO) to prefer higher-scoring outputs.
- Why it matters: Without rewards, the model may memorize examples instead of learning rules.
- Bottom Bread: Rewarded training helps on hard semantics—like better Term Weighting (+1.26% in the 8B model stage jump).
Sandwich 10 – Intent Description
- Top Bread: Imagine asking a friend, “What am I really trying to find?” and they answer in one sharp sentence.
- Filling (What it is): A natural-language summary of the user’s goal generated alongside structured fields.
- How it works: 1) Use entities, segments, and weights as context, 2) write one precise sentence capturing the task, 3) feed it to rewriting/ranking as an extra signal.
- Why it matters: Downstream models understand sentences easily and can align their decisions to the stated goal.
- Bottom Bread: “User wants steps for using Canmake Cream Blush 19 to achieve an Asian-style look” improves rewrites and ranking.
03Methodology
At a high level: Input (query + rules + context) → Business-Aware Prompt → Unified Generative LLM → One JSON with all tasks → Downstream search uses both structure and intent sentence.
Step 1: Build the Prompt with Rules and Live Context
- What happens: The system composes a prompt containing (a) task instructions and business rules, (b) the current query, (c) user rewrite history, and (d) top-K candidate notes from the platform.
- Why this exists: Rules ensure consistency; user history clarifies evolving intent; notes ground the model in real SNS content.
- Example: Query: “canmake cream blush 19 asian-style makeup.” History: “how to do asian-style makeup.” Notes: top posts showing liquid/cream blush techniques. Rules: entity types, segmentation patterns, taxonomy list, and how to write intent descriptions.
Sandwich 11 – User Rewrite History
- Top Bread: When you ask follow-up questions, your friend remembers what you asked before.
- Filling (What it is): A small list of recent queries in the same session.
- How it works: 1) Collect last k queries, 2) add to prompt, 3) let the model use them to resolve ambiguity.
- Why it matters: Without history, the model treats the current query as floating in space.
- Bottom Bread: “asian-style makeup” after “beginner wants to try” suggests tutorials, not product catalogs only.
Sandwich 12 – Candidate Notes
- Top Bread: Checking the bulletin board before answering a tricky question.
- Filling (What it is): A handful of top-matching posts from the platform, retrieved as anchors.
- How it works: 1) Retrieve by similarity, 2) filter to top-K, 3) place snippets in prompt.
- Why it matters: Without anchors, the model may miss fresh slang or emerging items.
- Bottom Bread: Notes showing “avoid powder; use creams” help produce the right intent sentence.
Step 2: Generate the Unified JSON in a Set Order
- What happens: The LLM writes a JSON in the sequence NER → Segmentation → Term Weighting → Taxonomy → Intent Description.
- Why this exists: Upstream results feed downstream steps, maximizing synergy and reducing contradictions.
- Example: Detect “canmake” as brand first, then segmentation respects that boundary, then weights prioritize “cream blush,” then taxonomy picks Beauty, then the intent sentence references shade 19 usage.
Step 3: Stage 1 – Knowledge Injection with Mixed SFT
- What happens: The model trains on a small, clean, human-annotated unified dataset plus huge auxiliary pseudo-labeled data from the old pipeline (task-by-task labels from logs).
- Why this exists: Real unified labels are scarce; pseudo-labels cover wide patterns. Mixing prevents overfitting and teaches breadth.
- Example: Millions of historical queries labeled for NER or segmentation by legacy models, blended with thousands of fresh expert-checked unified samples.
Sandwich 13 – Knowledge Injection
- Top Bread: Learning the map of a city from both a trusted guidebook and lots of crowd tips.
- Filling (What it is): Combining small gold data with large noisy pseudo-labels.
- How it works: 1) Sample unified gold, 2) sample task-specific pseudo, 3) jointly train with a balance weight, 4) learn broad coverage but keep correct schema.
- Why it matters: Only gold = too little breadth; only pseudo = too noisy.
- Bottom Bread: The model learns obscure product codes like “1c1” while staying aligned to business rules.
Step 4: Stage 2 – Target Distribution Alignment (Clean SFT)
- What happens: Fine-tune only on recent, human-annotated unified data to match the live platform distribution.
- Why this exists: Language and products change; this step corrects drift from older logs and locks in strict schema consistency.
- Example: Weekly refreshed labels capture new slang for makeup trends, ensuring correct taxonomy boundaries.
Step 5: Stage 3 – Multi-Reward RL (GRPO)
- What happens: The model generates multiple outputs; each is scored by real task metrics (F1 for NER/Seg; joint F1 for Term Weighting; a combined taxonomy score). The policy is updated to favor better-scoring outputs, with a KL-safety term to stay near the reference model.
- Why this exists: SFT can memorize; RL with verifiable rewards teaches rule-following and deeper logic.
- Example: Outputs that perfectly match entity spans and correct term-weight tuples earn higher returns; the model learns to avoid partial-span mistakes.
Sandwich 14 – GRPO (Group Relative Policy Optimization)
- Top Bread: Racing with a group and improving by beating your own batch best, not just an old record.
- Filling (What it is): A policy-gradient method that compares a group of samples and pushes the model toward higher-reward ones while controlling drift.
- How it works: 1) Generate G candidates, 2) compute per-task rewards, 3) combine with business weights, 4) update with group-relative advantages and a KL penalty.
- Why it matters: Stabilizes learning and focuses on what truly improves metrics.
- Bottom Bread: The 8B model’s Term Weighting rises from 64.91% to 66.17% after RL.
Step 6: Smart Data Sampling for RL
- What happens: The team picks samples where one target task varies in reward but others are stable, so the model knows exactly which skill to improve.
- Why this exists: If rewards are all mixed together, the model can’t tell which task needs help.
- Example: Choose queries where taxonomy is the main confusion while NER/Seg/Weights are steady.
Step 7: Deployment with Nearline Inference and KV-Cache
- What happens: QP-OneModel precomputes outputs daily and stores them. Online servers fetch cached signals fast; when a cache miss happens, the system falls back to nearline inference.
- Why this exists: Ensures low-latency and high-throughput for a busy SNS search.
- Example: Ranking uses NER, Taxonomy, and Term Weights as features; rewriting consumes the Intent Description as context.
Secret Sauce
- Unified generation enforces consistency and lets tasks help each other.
- Domain grounding (RedOne) makes the model fluent in SNS slang.
- Multi-reward RL aligns training directly with business metrics.
- Intent Description adds a clear, human-readable semantic bridge to downstream models.
04Experiments & Results
The Test: A Golden Test Set of ~2,500 fresh production queries was hand-labeled by trained annotators following expert-written protocols, with cross-checks to ensure quality. Metrics:
- NER & Segmentation: strict F1 (exact span/type and exact term boundaries).
- Term Weighting: joint F1 over exact (term, weight) pairs—no credit if the term is wrong even if the weight is right.
- Taxonomy: average of Top-1 accuracy and F1 (to balance main intent and full coverage).
- Overall: average of all sub-task scores.
Sandwich 15 – Task Metrics
- Top Bread: Like grading a project on neatness, accuracy, and completeness.
- Filling (What it is): Specific rules that score each sub-task fairly and strictly.
- How it works: 1) Define per-task metrics, 2) compute exact matches, 3) average appropriately, 4) compare across models.
- Why it matters: Without fair scoring, we can’t tell real progress from lucky guesses.
- Bottom Bread: The joint F1 for Term Weighting prevents cheating by forcing correct segmentation before scoring weights.
The Competition:
- Strong BERT-based pipeline (industry baseline).
- Task-isolated generative training.
- Alternative backbones: Qwen3 vs. RedOne.
- Bigger general LLMs (Qwen3-32B) and domain LLMs (RedOne) for unseen tasks.
Scoreboard with Context:
- Main offline results: QP-OneModel-8B lifts overall by +7.35% vs. BERT pipeline. Hard tasks jump big: NER F1 +9.01%, Term Weighting F1 +9.31%. Even QP-OneModel-0.6B beats the pipeline across the board, with Term Weighting +8.59% (56.86% → 65.45%). That’s like moving from a B- to a solid A in the toughest subjects.
- Unified vs. Task-Isolated: Under identical settings (Stage 1), the unified model wins overall (79.36% vs. 78.11%), especially on synergistic tasks (Seg and Weights). This shows the benefit of letting tasks inform each other.
- Backbone ablation: RedOne (SNS-adapted) edges out Qwen3 in Stage 1, especially on segmentation and taxonomy, proving domain grounding matters.
- Training stages: Stage 2 (fresh unified SFT) improves distribution alignment; Stage 3 (RL) adds notable gains on semantic-heavy tasks—8B Term Weighting climbs from 64.91% to 66.17% and NER from 83.62% to 83.86%.
Surprising Findings:
- Small but mighty: The 0.6B model still tops the old pipeline, showing the approach—not just parameter count—drives success.
- Generalization: On unseen Document Intent in few-shot ICL, QP-OneModel-8B hits 82.40% accuracy, beating a 32B model by 7.60%. On Authority Intent, it’s competitive with much larger models. This suggests the model learned a deep “meta-understanding” of queries.
Online A/B Tests (Real Users):
- Fundamental signals only: Replacing old signals with QP-OneModel’s reduces DCG 0/1 by 0.21% (lower is better, beyond the 0.15% significance bar) and cuts Zero/Few-Result Rate by 0.4631%.
- Intent Description in rewriting: Adding the sentence as training supervision and runtime context increases Note Effective CTR by +0.17% and SAU next-day retention by +0.044%.
Takeaway: Offline wins translate to online impact—better relevance and small-but-significant boosts in engagement and retention at platform scale.
05Discussion & Limitations
Limitations
- Annotation hunger: High-quality, up-to-date unified labels are essential for Stage 2 and RL scoring. Scaling expert labeling is costly and time-sensitive.
- Rule maintenance: Business rules in prompts need curation. If rules lag behind trends, the model may drift or produce schema-mismatched JSON.
- Latency/throughput: Although nearline caching helps, on-demand generation for cache misses still costs compute; very tight latency budgets may require further optimization or distillation.
- Pseudo-label noise: Stage 1 learns from legacy outputs that can be imperfect; careful mixing and weighting are necessary to avoid baking in old errors.
- Intent sentence faithfulness: While helpful, Intent Descriptions must avoid hallucinations; guardrails and validation are needed.
Required Resources
- Domain-adapted backbone (e.g., RedOne) and infrastructure for retrieval of candidate notes.
- A labeling pipeline with expert-written protocols and trained annotators.
- RL training stack (GRPO/PPO-style) with verifiable per-task reward evaluators.
- Nearline inference and KV-caching to meet production latency.
When NOT to Use
- Very low-resource settings lacking any expert labels or reliable legacy signals.
- Environments with rigid, unchanging taxonomies where a tiny discriminative model suffices and simplicity beats flexibility.
- Domains without access to domain-adapted pretraining or where privacy constraints prevent using historical logs, limiting Stage 1.
Open Questions
- Can we shrink the need for human labels using self-training, confidence filtering, or synthetic data?
- How to further reduce latency—through model distillation, caching strategies, or modular decoding?
- Can the approach extend to multi-lingual SNS with code-switching and emojis more robustly?
- How to ensure intent descriptions remain faithful and auditable—e.g., with automatic consistency checks?
- How stable is performance under rapid slang shifts, and can online learning update rules and prompts safely?
06Conclusion & Future Work
Three-Sentence Summary
- QP-OneModel turns five query-understanding tasks into one unified text generator, guided by business-aware prompts and trained with a three-stage plan: knowledge injection, target alignment, and multi-reward RL.
- It beats strong baselines offline (+7.35% overall; big F1 jumps in NER and Term Weighting), generalizes to unseen tasks better than a larger 32B model, and improves online relevance and retention.
- Its novel Intent Description provides a high-fidelity semantic bridge that boosts rewriting and ranking in real production.
Main Achievement
- Showing that a unified, reward-aligned generative model—with SNS-domain grounding—can replace fragile QP pipelines while improving accuracy, adaptability, and downstream impact.
Future Directions
- Distill the model for lower latency; automate rule updates; expand to more languages; add guardrails for faithful intent descriptions; explore continual RL updates with live, verifiable rewards.
Why Remember This
- It’s a practical blueprint for moving from brittle pipelines to coherent, end-to-end generative systems that both understand user intent deeply and speak the fast-changing language of social media—turning better semantic signals into real business wins.
Practical Applications
- •Power better query rewriting by feeding the Intent Description as context to produce clearer, goal-aligned rewrites.
- •Improve ranking features by using structured outputs (entities, term weights, taxonomy) to match documents more precisely.
- •Reduce zero-result queries by grounding ambiguous inputs with candidate notes and domain-aware segmentation.
- •Speed up operations by updating business rules in prompts without retraining, enabling rapid policy changes.
- •Boost cold-start coverage for emerging slang and products using Stage 1 knowledge injection from historical logs.
- •Guide vertical routing (e.g., beauty, fashion, travel) by relying on accurate Top-1 taxonomy labels.
- •Trigger special retrieval strategies (e.g., document-first or authority-first) using signals learned in generalization tests.
- •Audit and debug search behavior by reading the Intent Description to see if the model understood the user’s goal.
- •Distill the unified model into smaller, faster variants for on-device or edge use while keeping core accuracy.
- •Expand to multilingual search by extending prompts and stages with language-specific rules and labels.