Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

Krzysztof Wróbel; Jan Maria Kowalski; Jerzy Surma; Igor Ciuciura; Maciej Szymański

Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

Intermediate

Krzysztof Wróbel, Jan Maria Kowalski, Jerzy Surma et al.2/8/2026

arXiv

Key Summary

•Bielik Guard is a pair of small but strong Polish language safety models that check text for five kinds of risky content: hate/aggression, vulgar language, sexual content, crime, and self-harm.
•They were trained on 6,885 Polish texts labeled by over 1,500 volunteers, using soft labels (percentages) to capture disagreement and nuance.
•The 0.5B model scores best on the main test (F1 micro 0.791, macro 0.785), while the 0.1B model is extremely efficient and great for fast, low-cost use.
•On real Polish user prompts, Bielik Guard 0.1B v1.1 achieves very high precision (77.65%) and a very low false positive rate (0.63%), beating other Polish and multilingual moderators.
•The models focus on helping rather than just blocking—especially for self-harm, where they can trigger supportive responses instead of silence.
•A careful threshold calibration fix from v1.0 to v1.1 reduced over-flagging of crime and improved precision for production use.
•The models are robust to many small text tweaks (like missing accents), especially the 0.5B variant.
•Everything is public and already deployed, and the community can keep adding labels to make the models better over time.

Why This Research Matters

Bielik Guard shows that safety tools work best when they truly understand the language and culture of the people they protect. By focusing on Polish, it lowers false alarms, so users don’t get annoyed by constant, unnecessary warnings. It also spots real harm reliably and reacts with care—especially by offering help for self-harm, not just blocking. Because the models are compact, they are affordable and fast enough for real apps used by schools, communities, and companies. The community-driven labeling means the tool learns from real Polish speakers, not just translations. This approach can inspire similar safety efforts for other under-served languages. In short, it makes AI safer and kinder where people actually live and speak.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you're a school hall monitor who speaks only English, but the whole hallway is buzzing in Polish. You might miss jokes, slang, or even trouble brewing, just because you don’t catch the local language and culture.

🥬 The World Before: For a long time, safety tools for AI (like content moderators) were built mainly for English. When people tried to use them for Polish, they often didn’t get the details right. Multilingual giants could sort of help, but they were heavy, expensive to run, and sometimes misread Polish slang, sarcasm, or masked profanity. Polish-specific tools existed, but some were trained mostly on translated data. That’s like learning to catch a Polish pun from a dictionary—close, but not really how people talk.

🍞 Anchor: If someone wrote a Polish sentence with clever spelling tricks or regional slang to hide an insult, many older tools would either miss it or overreact.

🍞 Hook: You know how different families have different house rules, even about the same things? Safety in language is a bit like that—culture matters.

🥬 The Problem: The key challenge was building a Polish-first safety checker that understands local nuance, runs efficiently, and doesn’t shout “danger!” at normal talk. In safety work, two mistakes hurt: false positives (flagging safe stuff as unsafe) that annoy users, and false negatives (missing real harm) that leave people unprotected. For daily apps, too many false positives can be worse—you lose user trust fast.

🍞 Anchor: If your chat app keeps blocking your harmless messages, you’ll stop trusting it—or turn moderation off.

🍞 Hook: Think about guessing who finished the cookies. If you ask the whole family, you’ll hear different answers. The pattern in those disagreements actually tells you a lot.

🥬 Failed Attempts: Past Polish solutions leaned on translated datasets or on giant multilingual models. Translations can miss cultural shades, and giant models can still be clumsy on Polish, raising too many false alarms. Also, many systems forced a hard yes/no label, losing the helpful signal that people often disagree about borderline cases.

🍞 Anchor: If 6 out of 10 people think a joke is rude and 4 don’t, that 60% number is precious—it tells you it’s risky but not clear-cut.

🍞 Hook: Picture sorting library books. You don’t just have one shelf called “bad.” You have categories so you know what kind of help to give.

🥬 The Gap: Poland needed a compact, Polish-native safety classifier with a focused, practical set of categories and a way to learn from degrees of agreement. It also needed to be response-oriented—especially for self-harm—so the system offers help, not just a stop sign.

🍞 Anchor: If someone writes, “I feel like hurting myself,” a good system doesn’t just block it; it guides them to real support.

🍞 Hook: Think of bicycle brakes that are tuned—not too tight (stops you for no reason), not too loose (doesn’t stop when needed).

🥬 Real Stakes: In real Polish apps—banks, schools, forums, and helplines—moderation must be both accurate and gentle. Over-blocking frustrates users, under-blocking lets harm slip through. Bielik Guard aims for high precision (few false alarms) and quick, caring responses where it matters most.

🍞 Anchor: A Polish study platform can keep discussions safe without constantly interrupting normal chatter, and can step in with resources if someone signals self-harm.

02Core Idea

🍞 Hook: You know how a good referee in sports lets the game flow but steps in fast when there’s a real foul? Not every bump is a red card.

🥬 The Aha! Moment (one sentence): Train a Polish-native, compact safety classifier on community-labeled data with soft (percentage) labels, and tune its thresholds to maximize precision so it intervenes less often but more correctly—and offers help where needed.

Multiple Analogies:

Traffic Lights: Instead of turning every yellow into a hard red, the system learns how risky the intersection is from real drivers (annotators) and sets timings (thresholds) to keep traffic smooth but safe.
Smoke Detector: It’s tuned to catch real smoke, not every piece of toast. You want fewer false alarms so people pay attention when it beeps.
Librarian: It doesn’t just hide books; it places them on the right shelf (hate, vulgar, sexual, crime, self-harm) and, for sensitive shelves like self-harm, it adds a bookmark with help resources.

🍞 Anchor: On a Polish chat site, Bielik Guard rarely interrupts casual talk, but when someone posts instructions for fraud, it flags confidently; when someone signals self-harm, it triggers support.

🍞 Hook: Imagine grading essays with a slider from 0% to 100% instead of just pass/fail—it captures more truth.

🥬 Why It Works (intuition):

Community annotation with soft labels teaches the model the strength of human consensus. Borderline cases teach caution; unanimous cases teach confidence.
A focused 5-category Polish taxonomy avoids fuzzy, fact-heavy areas (like misinformation) that need context, and doubles down on immediate-safety harms where quick action matters.
Compact Polish RoBERTa encoders understand Polish morphology, slang, and masked profanity better than generic models.
Precision-first threshold calibration keeps false alarms low, building user trust while preserving strong detection.

🍞 Anchor: If 90% of annotators say a message is hate, the model learns “very likely hate.” If only 55% say so, the model learns “risky, but unclear,” which apps can treat more gently.

🍞 Hook: Think of before-and-after glasses that sharpen Polish nuances.

🥬 Before vs After:

Before: Translated data, high false positives, clunky moderation, and little support for sensitive cases.
After: Polish-native data, soft labels capturing nuance, low false positives, efficient models, and response-oriented help.

🍞 Anchor: A forum using Bielik Guard sees far fewer mistaken takedowns and better targeted help messages.

🍞 Hook: Building a treehouse needs several sturdy planks.

🥬 Building Blocks:

Taxonomy (5 Polish-focused categories) to guide consistent labeling and actions.
Two encoders: 0.1B for speed/efficiency; 0.5B for extra robustness.
Soft-label training (percent agreement) with BCE loss to learn risk strength.
Threshold calibration (v1.1 fix) to cut over-flagging of crime.
Data augmentation for diacritics, capitalization, and spacing tweaks to improve robustness.
Deployment API returning per-category probabilities so apps can set their own dial.

🍞 Anchor: A customer support bot sets a slightly stricter threshold for crime during high-risk hours, and a gentler one for borderline hate, all from the same probability outputs.

03Methodology

🍞 Hook: Imagine a safety conveyor belt: words hop on at one end, get carefully checked in the middle, and roll off with clear safety tags at the other end.

🥬 Overview (at a high level): Input text → tokenize Polish words → encode with Polish RoBERTa (0.1B or 0.5B) → classification head with five sigmoid outputs → per-category probabilities (0–1) → app decides actions using thresholds and playbooks.

🍞 Anchor: A message enters; the model outputs: Hate 0.12, Vulgar 0.08, Sex 0.03, Crime 0.87, Self-harm 0.01. The app flags crime and shows a warning.

— New Concept — 🍞 Hook: You know how you can be both tall and fast? One thing can fit more than one label. 🥬 Multi-Label Classification:

What it is: A way for AI to assign multiple categories at once to the same text.
How it works: (1) Read the text; (2) For each category, compute a score; (3) Use sigmoid to turn scores into probabilities; (4) Compare each probability to a threshold; (5) Output all categories that pass.
Why it matters: Without it, the system might miss that a message is both vulgar and hateful. 🍞 Anchor: A post can be tagged as both Hate and Vulgarities if both are present.

— New Concept — 🍞 Hook: When a class votes on a tricky question, you don’t just say yes/no—you also notice how split the class was. 🥬 Soft Labels (Percentage Labels):

What it is: Instead of hard yes/no, the model trains on the percent of annotators who chose each category.
How it works: (1) Collect multiple votes per text; (2) Convert to percentages (e.g., 66% hate); (3) Train with BCE loss on these targets; (4) Keep binary thresholds for evaluation or deployment.
Why it matters: It preserves uncertainty and teaches the model how strong or weak the signal is. 🍞 Anchor: Two insults—one rated hate by 100%, one by 60%—will be treated differently by the model.

— New Concept — 🍞 Hook: You don’t wear the same coat for winter and spring. Settings need tuning. 🥬 Threshold Calibration:

What it is: Choosing the cutoff probability where the model says a category is present.
How it works: (1) Analyze validation results; (2) Adjust thresholds to balance precision and recall; (3) Fix category-specific issues (v1.1 reduced over-flagging of crime); (4) Lock in production values.
Why it matters: Bad thresholds cause too many false alarms or missed harms. 🍞 Anchor: Lowering the crime threshold too much made v1.0 overreact; v1.1 fixed it.

— New Concept — 🍞 Hook: If a message has a missing accent or weird spacing, you still understand it. 🥬 Data Augmentation:

What it is: Making small, realistic text changes during testing (and considered in training) to test robustness.
How it works: (1) Change diacritics; (2) Swap capitalization; (3) Tweak characters or spacing; (4) Check performance drop.
Why it matters: Real users make typos; attackers try obfuscation. Robust models handle both better. 🍞 Anchor: "nienawisc" vs. "nienawiść" should still be caught if hateful.

— New Concept — 🍞 Hook: Think of two Polish teachers—one super fast and light, one more thorough and robust. 🥬 Polish RoBERTa Encoders (MMLW-RoBERTa-base and PKOBP/polish-roberta-8k):

What they are: Pretrained Polish text encoders; 0.1B (124M params, 50K vocab) and 0.5B (443M params, 128K vocab).
How they work: (1) Split text into tokens; (2) Turn tokens into vectors; (3) Produce a representation of the text; (4) Feed to a classifier.
Why it matters: Polish morphology and slang are tricky; native encoders understand them better than generic ones. 🍞 Anchor: The 0.5B model, with a bigger vocabulary, handles more unusual Polish word forms.

— New Concept — 🍞 Hook: A store needs clear aisles; so does safety work. 🥬 Taxonomy (5 Categories):

What it is: A tidy set of safety shelves: Hate/Aggression, Vulgarities, Sexual Content, Crime, Self-Harm.
How it works: (1) Define clear category rules; (2) Annotators label per category; (3) Model predicts per shelf; (4) Apps act differently per shelf (e.g., support for self-harm).
Why it matters: Fewer, clearer shelves mean better labels and better actions. 🍞 Anchor: Self-harm triggers helpful resources; crime triggers blocking and warnings.

Training Recipe:

Data: 6,885 Polish texts; ~7–8 votes each; balanced safe vs. unsafe; label binarization for evaluation at 60% agreement.
Loss/Optimizer: BCE with soft labels; AdamW; LR 2e-5; 3 epochs; A100 GPUs.
Splits: (1) 2:1 train/test for robust evaluation; (2) near-complete train for best production.
Versions: v1.0 (overreacted to crime); v1.1 (threshold fix improved precision/FPR).
API: Hugging Face pipeline returns per-category probabilities for custom thresholds.

Secret Sauce:

Polish-native data + soft labels capture cultural nuance and disagreement.
Precision-first thresholds reduce false positives, improving trust.
Two model sizes let you pick speed or robustness as needed.

🍞 Anchor: A bank chatbot uses the 0.1B model for speed; a moderation hub uses 0.5B for extra robustness.

04Experiments & Results

🍞 Hook: Imagine a sports day where each team plays sprints, relays, and obstacle courses. We don’t just count wins; we look at how clean and fair the runs were.

The Tests and Why:

Sojka Test Set: Balanced, real-like Polish data to measure overall skill.
Sojka Augmented: Same, but with typos/obfuscations to test robustness.
Gadzi Język: Crime-heavy dataset to stress-test the crime detector.
Real User Prompts: 3,000 random Polish prompts to see how polite (low false alarms) and correct (high precision) the models are in the wild.

The Competition:

HerBERT-PL-Guard (Polish baseline, similar size to 0.1B).
Llama Guard 3 (1B and 8B multilingual models).
Qwen3Guard-Gen-0.6B (multilingual generative guard).

Scoreboard with Context:

Sojka (balanced, clean): 0.5B v1.1a leads with F1 micro 0.791 and macro 0.785. That’s like an A when others are at B+ or lower on Polish.
Sojka Augmented (noisy): 0.5B v1.1a holds up with F1 micro 0.694 vs. 0.638 for 0.1B v1.1a—bigger model, better at handling typos and tricks.
Gadzi Język (97% crime): v1.1 improves precision/specificity by dialing thresholds, with a recall trade-off. 0.5B v1.1 F1 is 0.823 (slightly below v1.0’s 0.855) but with cleaner flags—good for production where false alarms hurt.
Real User Prompts: 0.1B v1.1 hits 77.65% precision and 0.63% FPR. That’s like calling fouls correctly 4 times out of 5 and wrongly less than 1 time in 150 messages—much better than HerBERT-PL-Guard (31.55% precision; 4.70% FPR) and large multilingual models (precision ~8–14%; FPR ~9–17%).

— New Concept — 🍞 Hook: If your smoke alarm shrieks for burnt toast, you’ll start ignoring it. 🥬 Precision, Recall, and False Positive Rate (FPR):

What they are: Precision = of what you flagged, how much was truly bad. Recall = of all bad stuff, how much you caught. FPR = of all safe stuff, how much you wrongly flagged.
How it works: Tune thresholds; measure trade-offs; pick settings that fit your app.
Why it matters: High precision and low FPR keep user trust; recall must still be reasonable. 🍞 Anchor: Bielik Guard focuses on high precision and very low FPR, especially on real user prompts.

Surprises and Takeaways:

Bigger isn’t always better: giant multilingual models flagged far too much safe Polish content. Native data matters.
Threshold calibration mattered a lot: small changes produced big improvements in real-world politeness (low FPR) with acceptable recall trade-offs.
Category difficulty varies: Hate is hardest (subjective, context-heavy), while Sex and Self-harm are easiest (clearer signals). All categories showed high AUC (>0.91), meaning the models can rank risky vs. safe well.

Bottom Line:

0.5B v1.1 is the accuracy champ, especially under noise.
0.1B v1.1 is the efficiency champ with excellent precision and very low FPR—ideal for cost-sensitive, large-scale deployments.

05Discussion & Limitations

🍞 Hook: Even the best umbrella won’t stop side-splash on a windy day—you still need to know its limits.

Limitations:

Language Scope: Optimized only for Polish; unknown transfer to other Slavic languages.
Taxonomy Scope: Leaves out misinformation, jailbreaks, and copyright—areas that need facts or extra context.
Domain Shift: Very specialized fields (e.g., medical/legal) may need more matching data.
Robustness Gaps: Good against natural typos/obfuscation; not yet tested against advanced adversarial attacks or prompt injections.
Cross-Taxonomy Comparisons: Real-world-meaningful but imperfect; recall not measured on user prompts since only flagged texts were annotated.

Required Resources:

0.1B runs cheaply and fast; 0.5B needs more memory but is still compact compared to giant LLMs.
Basic GPU (or even CPU for 0.1B at lower throughput) can serve production if latency budgets are modest.

When Not to Use:

If you must moderate disinformation or jailbreak attempts (not in the 5 categories).
If your app requires maximum recall on crime above all else (you may prefer v1.0 thresholds or a custom calibration).
If your traffic is non-Polish or heavy code-mixed without adaptation.

Open Questions:

How much does soft vs. hard labeling contribute to gains? (planned ablations)
Can a shared fixed taxonomy and fully labeled datasets enable recall comparisons across models?
How to extend to other Slavic languages with minimal retraining?
What’s the best way to learn from continuous production feedback safely?
Can explanations (with a small generative add-on) help moderators act more fairly and consistently?

06Conclusion & Future Work

🍞 Hook: Think of a referee who truly knows the local league—calls are fair, calm, and timely.

3-Sentence Summary:

Bielik Guard brings Polish-native, compact safety classifiers trained on community soft labels to detect five immediate-risk categories with high precision and low false alarms.
The 0.5B model leads on accuracy; the 0.1B model shines in efficiency and real-world politeness (very low FPR), beating larger multilingual guards on Polish text.
A response-oriented design, especially for self-harm, turns moderation into meaningful help rather than simple blocking.

Main Achievement:

Showing that authentic Polish data, soft-label learning, and careful threshold calibration can outperform much larger multilingual systems on Polish safety tasks—while staying efficient.

Future Directions:

Expand categories, explore multilingual Slavic variants, add explainability, and run rigorous ablations to separate the effects of soft labels, augmentation, and thresholds.
Build shared-taxonomy benchmarks and fully annotated datasets to compare recall across models.

Why Remember This:

In safety, local language and culture matter. Small, well-trained, well-calibrated models can beat giants when they truly understand the people they’re protecting.

Practical Applications

•Moderate Polish chat platforms with low false alarms, keeping conversations smooth.
•Add gentle, supportive responses for self-harm messages in helplines and wellness apps.
•Screen Polish customer support tickets for crime-related instructions or threats.
•Filter user-generated content (reviews, comments) for hate and vulgarities without over-blocking.
•Pre-check prompts sent to Polish LLMs to reduce unsafe generations.
•Protect educational forums by flagging sexual content and providing age-appropriate handling.
•Use the lightweight 0.1B model for on-device or edge moderation in cost-sensitive deployments.
•Adopt category-specific thresholds (stricter for crime at night, gentler for borderline hate) via the API.
•Run audits on archived Polish content to map risk hotspots by category.
•Continuously improve moderation by collecting thumbs-up/down feedback and retraining.

Version: 1