How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Ziwen Xu; Kewei Xu; Haoming Xu; Haiwen Hong; Longtao Huang; Hui Xue; Ningyu Zhang; Yongliang Shen; Guozhou Zheng; Huajun Chen; Shumin Deng

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Beginner

Ziwen Xu, Kewei Xu, Haoming Xu et al.3/3/2026

arXiv

Key Summary

•Large language models can act unpredictably in sensitive places like schools, hospitals, and customer support, so we need reliable ways to guide how they talk and behave.
•This paper introduces SteerEval, a new benchmark that checks how well we can control model behavior at three levels: what to say (L1), how to say it (L2), and the exact words or symbols to use (L3).
•SteerEval tests control across three everyday domains: language features (like being brief or repetitive), sentiment (happy vs. neutral), and personality (autonomous vs. deferential).
•The team built 7,560 carefully crafted, paired examples using an automated pipeline plus human checks, so results are trustworthy and repeatable.
•Prompting (giving the model clear instructions or examples) works strongly and stays stable across all levels; activation-based methods (nudging hidden neurons) often drop in performance at the finest level (L3).
•Fine-grained control (L3) is the hardest: even when models try very hard to follow rules, they can still miss tiny, checkable details like a required keyword.
•More few-shot examples help at coarser levels (L1/L2), but can hurt at L3 because extra examples add distracting surface patterns.
•Turning up activation-steering strength improves hitting the concept but can break instruction-following and fluency, so there’s a trade-off and a sweet spot.
•SteerEval links high-level goals to checkable text, giving researchers a clear map of where current control methods work—and where they fail.
•This matters in real life because better control means kinder chatbots, safer advice, consistent tone for brands, and fewer surprises from AI.

Why This Research Matters

When AI helps people, tone and style matter as much as facts. A health assistant should be calm and clear, not panicky; a study buddy should be supportive, not dismissive; a brand chatbot should sound on-message every time. SteerEval shows whether our control methods can deliver that consistency from big goals (be kind) down to tiny details (include a specific warning). By revealing where control breaks—especially at fine, checkable details—teams can fix issues before deployment. This means fewer surprises, safer advice, and interactions that feel more human and trustworthy. In short, better control makes AI more helpful and reliable in everyday life.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re the director of a school play. Your actors (the AI models) are amazing at improv, but sometimes they forget the script, switch moods randomly, or deliver lines in a totally different style than you wanted. That’s exciting for creativity but risky when the play needs to fit a plan.

🥬 The Concept (Controllability of LLMs): Controllability means getting a language model to reliably act the way you intend, not just sometimes but almost every time.

What it is: A model’s ability to follow desired content and style constraints predictably.
How it works: You define the target behavior, send a steering signal (like a prompt or an internal nudge), and check if the output matches the plan without breaking quality.
Why it matters: Without control, models can sound off-topic, shift tone, or miss crucial details—bad news for tutoring, medical info, or customer support.

🍞 Anchor: Asking a tutor-bot to “explain fractions gently to a worried student” should yield a calm, encouraging reply, not a harsh or robotic one.

🍞 Hook: You know how Lego builds start with a big idea (castle), then a plan (walls and towers), then exact bricks (2x4, 1x2)?

🥬 The Concept (Behavioral Granularity): Behavioral granularity means steering AI at three layers: the big idea, the plan, and the tiny details.

What it is: A layered way to specify behavior—from what to say (L1), to how to say it (L2), to specific, checkable markers (L3).
How it works:
1. L1: Set the high-level direction (e.g., be autonomous).
2. L2: Pick the strategy (e.g., make self-driven choices).
3. L3: Enforce exact surface cues (e.g., must include the word “self-authored”).
Why it matters: Success at the big idea doesn’t guarantee the tiny details show up; without layers, we can’t diagnose where control fails.

🍞 Anchor: “Be enthusiastic” (L1) → “Use a celebratory tone” (L2) → “Include ‘hooray’” (L3).

The World Before: LLMs were great at many tasks but often unpredictable in style, tone, and persona. Benchmarks existed for narrow skills (like safety or sentiment), but they weren’t organized by behavior layers. This made cross-method comparison messy—was a failure due to the method or because the target was too fine-grained?

The Problem: We lacked a single, principled way to test if control survives as instructions become more precise—from general intent to exact words.

Failed Attempts:

One-off tests (e.g., just sentiment) didn’t generalize across behaviors.
Feature-based datasets (like ones derived from sparse autoencoders) were interpretable in parts but not aligned to everyday, named behaviors people care about.
Evaluations often used prompts not tailored to the target concept, making results hard to compare.

The Gap: A unified, hierarchical benchmark linking high-level goals to checkable text across multiple domains was missing.

Real Stakes:

Customer support: consistent tone, no mood swings.
Education: age-appropriate explanations that stay supportive.
Healthcare info: calm, clear phrasing that avoids panic.
Workplace tools: predictable style that matches brand voice.
Safety: reduce surprises when requests get tricky.

🍞 Anchor: Think of SteerEval like a driving test that checks highway driving (L1), city rules (L2), and parallel parking to exact inches (L3). Passing one doesn’t guarantee passing the others—so you test all three.

02Core Idea

🍞 Hook: Picture a treasure map that doesn’t just say “Go north,” but also tells you which path to take and which pebbles to count along the way to prove you stayed on track.

🥬 The Concept (SteerEval): SteerEval is a benchmark that checks how well we can guide AI behavior from big intentions to tiny, verifiable details across three domains: language features, sentiment, and personality.

What it is: A hierarchical evaluation that connects what to express (L1), how to express it (L2), and how to instantiate it (L3).
How it works: It generates concept hierarchies, tailored questions, and contrastive answer pairs, then scores model outputs on concept match, instruction following, and fluency—plus a harmonic-mean summary.
Why it matters: It reveals where control breaks as we get more precise, guiding better methods and safer AI.

🍞 Anchor: If you ask for “be redundant (L1), by rephrasing (L2), and include ‘i.e.,’ (L3),” SteerEval checks each step, not just the vibe.

The Aha Moment in one sentence: Control must be tested as a staircase—intent, strategy, evidence—because passing the first steps doesn’t prove you can climb the last.

Multiple Analogies:

Cooking: Decide the dish (L1), choose a technique like sauté vs. roast (L2), then require a specific spice you can taste (L3).
Music: Pick the genre (L1), choose the arrangement style (L2), then include a signature riff everyone recognizes (L3).
Sports playbook: Call a running play (L1), pick an inside zone scheme (L2), then require the snap count “on two” (L3).

Before vs After:

Before: Mixed evaluations; success at “tone” didn’t say much about hitting exact phrases; methods looked better or worse depending on the test.
After: A clear, apples-to-apples framework shows prompting is robust across levels, while activation steering struggles at L3, pinpointing where and why.

Why It Works (intuition, no equations):

Human communication is layered; models likely store similar layers internally.
Coarse goals are easier because many wordings fit.
Strategies narrow options; evidence-level constraints demand rare tokens or patterns, which are easy to miss.
A harmonic mean punishes any single weak spot, mirroring real-world needs (you want the right content, followed clearly, written well).

Building Blocks (with mini Sandwich explainers):

🍞 Hook: You know how lessons cover different subjects? 🥬 Behavioral Domains: Personality, Sentiment, Language Features.
- What: Three everyday areas where style and tone matter.
- How: Each gets L1→L3 tasks.
- Why: Different domains stress different control skills. 🍞 Anchor: “Sound autonomous” (personality), “be enthusiastic” (sentiment), “add restatements” (language features).
🍞 Hook: Like zooming from a city view to street view to a house number. 🥬 Levels L1–L3: What, how, instantiate.
- What: L1 intent; L2 expression strategy; L3 exact markers.
- How: Progressive constraints tighten freedom.
- Why: Lets us see exactly where control falls apart. 🍞 Anchor: “Be positive” → “celebrate wins” → “include ‘hooray’.”
🍞 Hook: Giving instructions vs. secretly nudging the engine. 🥬 Prompt-based vs Activation-based Steering: Two ways to steer.
- What: Prompts change inputs; activation methods nudge hidden states.
- How: Prompts give examples or rules; activations add vectors or directions.
- Why: They can succeed at different layers; comparing them shows trade-offs. 🍞 Anchor: A pep talk before the game (prompt) vs. adjusting muscle memory (activation).

03Methodology

At a high level: Input (domain + desired size) → Step A: Build concept hierarchy (L1→L2→L3) → Step B: Create and refine questions → Step C: Generate paired answers (match vs not-match) → Step D: Quality checks (auto + human) → Step E: Evaluate models and steering methods.

Step A: Hierarchical Concept Synthesis

What happens: The system picks a domain (e.g., Sentiment), writes a precise domain description, and generates 3-layer concepts: L1 (intent), L2 (strategy), L3 (atomic markers).
Why this step exists: Without a clean, separated hierarchy, we can’t tell whether failures come from the idea, the strategy, or the final evidence.
Example: Domain = Language Features; L1 “Increase redundancy,” L2 “Immediate paraphrase,” L3 “Must include ‘i.e.,’.”

Step B: Question Generation and Refinement

What happens: For each concept, the system creates many questions plus an anchor example. Then it rewrites questions to pivot toward a related-but-different concept to avoid giving away the target by wording.
Why this step exists: If the question itself screams the answer, we overestimate control; pivoting reduces leakage and makes success more meaningful.
Example: Target = “be autonomous.” Raw: “How do you make your own choices?” Refined pivot: “Who do you rely on for decisions?” Now a good autonomous answer has to push against the grain.

Step C: Paired Answer Generation (Minimum Difference)

What happens: For each refined question, create two short answers: one matching the target concept and one opposing it, with minimal token changes to isolate the concept difference.
Why this step exists: Tiny edits mean we’re testing the behavior itself—not style drift or length differences.
Example: “I chose this path myself…” vs. “They chose this path for me…”.

Step D: Quality Assurance (Two Stages)

What happens:
1. Automated filtering for correct format and counts.
2. Human review with dual verification and consensus to ensure labels and semantics are right.
Why this step exists: Synthetic pipelines can misfire; humans ensure clarity, correctness, and safety.
Example: If “hooray” is required at L3 but missing, it’s flagged and fixed or removed.

Step E: Models, Steering, and Scoring

What happens:
- Methods: • Prompt-based: 0-shot and few-shot (3-shot, and explored up to 16-shot). • Activation-based: PCA, DiffMean, RePS (learned steering vectors).
- Scoring dimensions (0–4): Concept Score (did it hit the target), Instruction Score (did it answer the question), Fluency Score (is it readable). Harmonic Mean (HM) summarizes them, punishing any weak link.
Why this step exists: Real-world outputs must satisfy all three: be on-behavior, on-task, and readable.
Example data: 7,560 samples across Personality, Sentiment, Language Features; each concept has 70 train, 30 test, 5 validation examples.

The Secret Sauce

Hierarchical design: Clean separation of intent→strategy→evidence shows where steering breaks.
Question pivoting: Reduces giveaway cues so success reflects true control.
Minimal-difference pairs: Isolate the behavioral flip with tiny edits, boosting fairness.
HM metric: Encourages balanced performance instead of “gaming” one dimension.

Mini Sandwich Explainers for key tools:

🍞 Hook: Like telling a friend exactly how to answer a text. 🥬 Prompt-based Steering
- What: Guide outputs with instructions/examples.
- How: Add rules/demos to the prompt; model imitates the pattern.
- Why: Easy to apply, often robust across levels. 🍞 Anchor: “Use an upbeat tone and say ‘hooray’ once.”
🍞 Hook: Like tuning a guitar string to change the sound without changing the song sheet. 🥬 Activation-based Steering
- What: Nudge hidden neuron activations along a concept direction.
- How: Compute or learn a vector (PCA, DiffMean, RePS) and add it during generation.
- Why: Can be powerful, but risks hurting task-following and fluency at fine granularity. 🍞 Anchor: Push a “positivity” vector to boost cheerful wording—unless it overpowers the instructions.
🍞 Hook: Report cards average tests, but one F can sink a GPA. 🥬 Harmonic Mean (HM)
- What: A combined score that drops sharply if any part (concept, instruction, fluency) is weak.
- How: Summarizes three 0–4 scores into one cautious average.
- Why: Prevents models from looking good by acing only one thing. 🍞 Anchor: Great tone + wrong answer = still not a good helper.

04Experiments & Results

The Test: Measure whether steering methods can hit behaviors across domains (Personality, Sentiment, Language Features) and across L1→L3. The key question: does control survive as the target gets more specific?

The Competition:

Prompt-based: 0-shot and 3-shot (also extended shots in analysis).
Activation-based: PCA, DiffMean, and RePS (learned).
Baseline: Vanilla (no steering).
Models: Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, Llama-3.1-8B-Instruct.

The Scoreboard (with context):

Prompting wins overall and stays steady from L1 to L3. Think: often scoring around 3.0 HM (out of 4), which is like getting an A while others hover around C–B. On Gemma-2-9B, Prompt 0/3-shot $HM ≈ 3$ .10/3.12 averaged broadly.
Activation steering can look good at L1 and sometimes rival prompting, but drops sharply at L3—like starting the game strong then missing the final, tiny free throws. For example, activation methods’ HM can plunge near zero on strict L3 constraints in some settings, showing difficulty with token-level requirements.
Domains differ: Activation methods generally did best on Personality, then Sentiment, then Language Features. This supports the idea that high-level traits are easier to nudge than surface formatting quirks.

Surprising Findings:

Activation methods can match or beat prompting at the coarsest level (L1) when well-tuned—suggesting strong global “direction” vectors exist.
Adding more few-shot examples helps at L1/L2 but can hurt at L3—extra examples introduce distracting patterns that compete with atomic constraints.
Turning up activation-steering strength boosts concept hits, but past a sweet spot it harms instruction-following and fluency, shrinking the overall HM.

Why the numbers matter:

Hitting L1 shows a method can steer the “vibe.”
Hitting L2 shows it can guide the “style of delivery.”
Hitting L3 proves it can follow exact, rare cues (e.g., a required word) without going off-task or becoming unreadable.
SteerEval reveals the cliff: many methods walk confidently at L1, wobble at L2, and slip at L3.

Mini Sandwich Explainers for findings:

🍞 Hook: Like aiming a flashlight—easy to point in the right direction, hard to trace a tiny drawing. 🥬 Coarse vs Fine Control
- What: Control strength fades from L1→L3.
- How: Broad signals guide tone; exact tokens require surgical precision.
- Why: Tiny markers are rare patterns that models don’t naturally produce. 🍞 Anchor: “Be cheerful” is easy; “say ‘hooray’ exactly once” is hard.
🍞 Hook: More coaches can help—or confuse. 🥬 Few-shot Scaling
- What: Extra examples help at L1/L2, can backfire at L3.
- How: More demos clarify the task but add surface quirks.
- Why: L3 needs laser focus, not extra noise. 🍞 Anchor: Teaching cursive letters (L3) with too many fonts confuses the exact strokes.
🍞 Hook: Turning up the music makes it louder, but you can’t hear your friend talk. 🥬 Steering Strength Trade-off
- What: Stronger activation nudges boost concept but may hurt task/fluency.
- How: Push vector scaling until HM peaks.
- Why: Overpowering the model drowns out instructions. 🍞 Anchor: A perfectly seasoned soup vs. too much salt. There’s a sweet spot.

05Discussion & Limitations

Limitations:

Coverage: The benchmark focuses on single-turn prompts and select domains (Personality, Sentiment, Language Features), plus a separate Reasoning Patterns set in the appendix. It doesn’t include multi-turn chats, tool use, long context, or high-stakes safety scenarios.
Tuning sensitivity: Activation-based results depend on layer choices and steering strengths; the paper sweeps but can’t guarantee the absolute best per concept.
LLM-as-a-judge: Automated graders (for concept, instruction, fluency) can carry biases; scores should be read as strong signals, not perfect truth.

Required Resources:

Compute to generate the dataset and run steering sweeps (the paper used A800 GPUs over a week).
Access to instruction-tuned models and steering frameworks (e.g., EASYEDIT2).
Some human review time for quality assurance.

When NOT to Use:

If you only need general tone changes and don’t care about exact words, a simpler evaluation might suffice.
If your application relies on multi-turn dynamics or tool calls, SteerEval’s single-turn scope may be too narrow for final decisions.
If strict L3 constraints collide with essential task instructions (e.g., including a rare word that confuses the answer), consider relaxing goals or using specialized constrained decoding.

Open Questions:

How to achieve reliable L3 control without harming instruction-following and fluency?
Can multi-turn or tool-augmented steering stabilize fine-grained control?
Can we learn general-purpose, transferable steering vectors that stay robust across domains and levels?
What new evaluation designs reduce LLM-judge bias, especially at L3?
How do we combine prompting and activation steering for the best of both worlds?

06Conclusion & Future Work

Three-sentence summary: SteerEval provides a single, hierarchical benchmark that tests LLM control from broad intent (L1) to exact, checkable details (L3) across personality, sentiment, and language features. Prompting is strong and stable across levels, while activation-based steering often weakens at finer granularity, especially L3. This exposes precise failure points and gives a roadmap for building safer, more predictable AI.

Main Achievement: Connecting what-to-say, how-to-say-it, and how-to-prove-it with tailored data and balanced metrics—turning fuzzy “control” into a clear, testable ladder.

Future Directions: Extend to multi-turn dialogue, tool use, safety-critical settings, and composition of multiple concepts; reduce reliance on LLM judges; explore hybrid steering (prompt + activation) and better constraint satisfaction methods.

Why Remember This: When you need AI that’s not just smart but reliably on-script, SteerEval tells you whether your control method holds steady from the big picture down to the tiny, prove-it details—and shows exactly where to improve.

Practical Applications

•Tune customer support bots to keep a consistent, friendly brand voice across all replies (L1/L2) and require exact disclaimers when needed (L3).
•Configure educational tutors to maintain encouraging tone (L1), use growth-mindset framing (L2), and include a specific check-in phrase (L3).
•Set up healthcare info bots to avoid alarming language (L1), use reassurance strategies (L2), and include a standard safety note verbatim (L3).
•Design company writing assistants to enforce concise style (L1), use structured bullets (L2), and include a required summary line (L3).
•Evaluate steering methods for internal policy compliance: measure if exact legal phrases appear correctly (L3) without degrading clarity.
•Build safer sandbox tests for new activation-steering ideas, using SteerEval to find the best strength that maximizes HM.
•Train marketing content generators to stay upbeat (L1), celebrate user wins (L2), and include a brand tagline exactly once (L3).
•Assess agent personas for consistency across tasks (Personality domain) before rolling out multi-tool systems.
•Debug prompt templates by checking which level (L1/L2/L3) fails most often and refining accordingly.
•Benchmark different models for controllability before procurement, choosing the one that keeps high HM at your required granularity.

Version: 1