Helping developers build safer AI experiences for teens
Key Summary
- •This work gives developers ready-to-use teen safety policies written as prompts that plug directly into an open-weight safety model called gpt-oss-safeguard.
- •Instead of every team inventing rules from scratch, these prompts translate expert youth-safety guidance into clear, checkable instructions for AI systems.
- •The policies cover high-risk areas for teens like graphic violence or sex, harmful body behaviors, dangerous challenges, roleplay risks, and age-restricted goods.
- •They are designed for both real-time filtering (during a chat) and offline analysis (reviewing posts or comments later).
- •External experts such as Common Sense Media and everyone.ai helped shape what to cover and how to handle edge cases.
- •This is a starting toolkit, not a complete solution—developers should adapt policies to their app, add parental controls, transparency, monitoring, and safe responses.
- •The approach follows a layered defense in depth: multiple protections working together to better protect young users.
- •Because the prompts and model weights are open, the community can translate, extend, and improve them over time.
- •The main idea is simple: clearer rules make safer AI decisions, especially for teens who need extra care.
- •The release aims to raise the safety floor across the open-weights ecosystem while still supporting innovation.
Why This Research Matters
Teens are growing, curious, and online—so the tools they use need extra care tailored to their stage of life. Clear, prompt-based policies help AIs make consistent, safer choices without guessing. Open weights and open-source prompts mean schools, startups, and nonprofits can adopt strong safety foundations without starting from zero. Expert input and community iteration raise the safety floor while respecting different cultures and app contexts. Real-time checks protect in the moment, and offline reviews help systems improve over time. With layered defenses, a single miss doesn’t have to become a major harm. Altogether, this helps create digital spaces where teens can learn, create, and explore more safely.
Reading Workflow
Turn this paper into a decision
Scan fast. Promote only the papers that survive triage.
No workflow history yet.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): You know how schools have special rules to keep younger students safe—like no running in the halls and use the buddy system on field trips?
🥬 Filling (The Actual Concept): Teen Safety Policies
- What it is: Teen Safety Policies are clear rules that tell AI what is okay and not okay for teens.
- How it works:
- Experts define risky topics for teens (like graphic violence or age-restricted products).
- These rules get written in plain, testable language the AI can follow.
- The AI checks messages or images against these rules before responding or posting.
- Why it matters: Without teen-specific rules, AI might treat teens like adults or miss youth-specific risks, leading to harmful or confusing experiences.
🍞 Bottom Bread (Anchor): Imagine a teen asks an AI, “Should I try a viral stunt where people pass out for fun?” Policies help the AI say, “That’s dangerous, here’s why, and here’s a safer alternative.”
🍞 Top Bread (Hook): Picture a safety net under a trapeze—if someone slips, the net catches them.
🥬 Filling (The Actual Concept): gpt-oss-safeguard
- What it is: gpt-oss-safeguard is an open-weight safety model that acts like the safety net for AI systems.
- How it works:
- It reads the policy prompts (the rules of what’s allowed for teens).
- It looks at the content (like a message) and checks it against those rules.
- It flags, blocks, or guides safer responses depending on the category.
- Why it matters: Without a dependable safety model, even good rules won’t be applied consistently, and unsafe content can slip through.
🍞 Bottom Bread (Anchor): If someone types, “I’m 15—can I buy a vape online?”, gpt-oss-safeguard can spot the age-restricted request and steer the conversation to health facts and legal limits instead.
🍞 Top Bread (Hook): Think of a water filter that catches dirt so you only drink clean water.
🥬 Filling (The Actual Concept): Content Filtering
- What it is: Content filtering removes or reshapes content that’s not safe for the user’s age.
- How it works:
- The system reads the content.
- It compares the content to the safety rules.
- It allows, blocks, or rewrites the response to fit teen-safe guidelines.
- Why it matters: Without filtering, harmful material can reach teens instantly and repeatedly.
🍞 Bottom Bread (Anchor): A post encouraging a starvation “challenge” gets caught and replaced by supportive, factual guidance about healthy habits.
🍞 Top Bread (Hook): Imagine a referee who watches a game live and calls fouls right when they happen.
🥬 Filling (The Actual Concept): Real-time Analysis
- What it is: Real-time analysis checks content as it is created, not later.
- How it works:
- The user sends a message.
- The safety system evaluates it immediately.
- The app responds safely in the same moment.
- Why it matters: Without real-time checks, harmful content might be seen before anyone can stop it.
🍞 Bottom Bread (Anchor): During a live chat, if a user asks for graphic content, the system gently declines and offers a safer alternative right away.
🍞 Top Bread (Hook): Just like bikes come in kid sizes with extra safety features, apps need protections that fit a teen’s stage of life.
🥬 Filling (The Actual Concept): Age-appropriate safeguards
- What it is: Protections tailored to what’s suitable for different ages.
- How it works:
- The app knows (or estimates) a user is under 18.
- It applies stricter rules and safer defaults for that age group.
- It provides helpful, non-judgmental guidance when blocking.
- Why it matters: Without age-fit protections, teens may see adult content or miss supportive guidance they need.
🍞 Bottom Bread (Anchor): A 16-year-old using a creative writing app gets guardrails against violent roleplay, plus tips for safer storytelling.
🍞 Top Bread (Hook): When you ride a skateboard, you might wear a helmet, elbow pads, and knee pads—not just one thing—because layers protect better together.
🥬 Filling (The Actual Concept): Layered Defense in Depth
- What it is: A strategy that combines multiple safety measures so if one layer misses something, another catches it.
- How it works:
- Policy prompts guide the model.
- The safety classifier filters content.
- Product features (like parental controls) and monitoring add backup.
- Why it matters: Without layers, a single mistake can cause real harm; with layers, the system is more resilient.
🍞 Bottom Bread (Anchor): Even if one filter misses a risky “challenge,” reporting tools and after-the-fact review can still catch and fix it.
The world before this release looked like lots of teams trying to protect teens but struggling with fuzzy rules. Developers knew they needed to filter harmful content, yet translating high-level goals (“be safe for teens”) into day-to-day decisions (“allow this, block that, provide this helpful response”) was hard. Many used adult-focused rules, which either over-blocked (stifling useful information) or under-blocked (letting teen-specific risks slip). The gap was clear: we needed practical, consistent, and sharable policies tuned for teens—and a friendly way to plug them into safety models. This work fills that gap by offering prompt-based policies designed for gpt-oss-safeguard and other reasoning models, shaped with input from trusted external experts. The stakes are real: safer chats, healthier body image guidance, fewer dangerous stunts, and more respectful, age-aware AI for millions of young people.
02Core Idea
🍞 Top Bread (Hook): You know how a good recipe turns a tasty idea into exact steps anyone can follow?
🥬 Filling (The Actual Concept): The “Aha!” Moment
- One-sentence key insight: If we write teen safety rules as clear prompts, safety models can consistently turn expert guidance into real, working protections.
Multiple Analogies:
- Rulebook Analogy: It’s like giving the referee a detailed rulebook instead of saying “just keep it fair.” With precise rules, calls are more consistent.
- Seatbelt Analogy: A seatbelt doesn’t guess—it clicks into place the same way every time. Prompt-based policies make safety “click” the same way across apps.
- Recipe Analogy: A recipe shows ingredients and steps so different cooks make the same dish. Policy prompts help different developers achieve the same safe standards.
Before vs After:
- Before: Teams hand-crafted rules, often vague, inconsistent, and hard to maintain, especially for teen-specific risks.
- After: Developers start from a shared, expert-informed set of prompts that they can adapt, test, and improve, making protections more even and dependable.
Why It Works (intuition, not equations):
- AI models are great pattern matchers. When we describe safety as precise, step-by-step instructions (prompts), models can spot risky patterns more reliably. The clearer the policy language, the fewer gray areas the model must guess about. By defining not just what to block, but also how to respond kindly and constructively, we reduce harm and increase helpfulness at the same time.
Building Blocks (explained with the Sandwich Pattern for each core concept):
🍞 Top Bread (Hook): Imagine class rules posted on the wall so everyone knows what’s okay. 🥬 Filling: Teen Safety Policies
- What: A set of teen-specific, expert-informed rules.
- How: Organized as prompts the AI can read and follow.
- Why: Consistency and clarity beat guesswork. 🍞 Bottom Bread (Anchor): A clear rule like “no dangerous challenges” helps the AI redirect to safe activities.
🍞 Top Bread (Hook): Think of a trained lifeguard who knows what to watch for. 🥬 Filling: gpt-oss-safeguard
- What: The open-weight safety model that applies the rules.
- How: It checks content against the prompts and labels it.
- Why: Without a reliable enforcer, rules won’t stick. 🍞 Bottom Bread (Anchor): The model sees “I’m 14, can I buy alcohol?” and applies the age-restricted policy.
🍞 Top Bread (Hook): Like a sieve that lets sand through but stops rocks. 🥬 Filling: Content Filtering
- What: Deciding if content passes, is blocked, or is reshaped.
- How: Compare content to policy categories.
- Why: To prevent harmful material from reaching teens. 🍞 Bottom Bread (Anchor): A harmful body-ideal “tip” gets replaced by healthy, evidence-based advice.
🍞 Top Bread (Hook): A coach who calls time-out right when needed. 🥬 Filling: Real-time Analysis
- What: Instant checking during conversations.
- How: Evaluate each message before showing it.
- Why: Stop harm before it spreads. 🍞 Bottom Bread (Anchor): The app declines a violent roleplay request on the spot and suggests safer themes.
🍞 Top Bread (Hook): Training wheels make bikes safer for beginners. 🥬 Filling: Age-appropriate safeguards
- What: Protections fitting teen maturity.
- How: Apply stricter defaults and supportive tone.
- Why: Teens need guidance, not just blocks. 🍞 Bottom Bread (Anchor): A gentle message explains why a request is blocked and offers helpful resources.
🍞 Top Bread (Hook): Castle walls, moat, and guards work together. 🥬 Filling: Layered Defense in Depth
- What: Many safety layers at once.
- How: Policies + classifier + product features + monitoring.
- Why: If one layer misses, another catches. 🍞 Bottom Bread (Anchor): Even if chat filtering misses something, parental controls and reporting still help.
In short, the core idea turns high-level teen safety goals into precise, reusable prompts that safety models can act on—raising the safety floor while keeping room for innovation.
03Methodology
At a high level: Input content → Policy Prompts → gpt-oss-safeguard Classification → Safety Action (allow, block, or safe alternative) → Logging & Review → Continuous Improvement.
Step-by-step (with the Sandwich Pattern where new core concepts first appear):
- Prepare Policy Prompts
- What happens: Developers load the teen safety prompts that define categories like graphic violence, graphic sexual content, harmful body ideals/behaviors, dangerous activities/challenges, romantic or violent roleplay, and age-restricted goods/services. Each prompt includes definitions, edge cases, and decision rules.
- Why this step exists: Without precise prompts, the model may be inconsistent or over/under-block.
- Example: A prompt might say, “If the user self-identifies as under 18 and asks about buying alcohol, classify as Age-Restricted Goods and suggest legal and health info instead of instructions.”
- Ingest Content (messages, posts, captions)
- What happens: The user’s message arrives with basic context (e.g., language, app mode, optional age estimate from a separate system).
- Why: The model needs the actual content and minimal context to reason safely.
- Example: “I’m 15. Is it safe to try the blackout challenge I saw on TikTok?”
- Apply gpt-oss-safeguard with the Policy Prompts
- What happens: The safety model reads the prompts, evaluates the content, and outputs a label (e.g., Dangerous Activities) plus a short rationale and confidence.
- Why: This is the heart of consistent enforcement.
- Example output: Category = Dangerous Activities; Rationale = Encourages a fainting stunt; Confidence = High.
- Decide the Safety Action (Content Filtering)
- What happens: Based on the label and confidence, the system chooses: allow as-is, block, or transform the response.
- Why: Filtering translates classification into user-facing protection.
- Example: For the blackout challenge, the system blocks any how-to and instead returns a safety warning and evidence-based health info.
- Generate a Teen-Appropriate Response (Age-appropriate safeguards)
- What happens: If blocked or transformed, the app explains why in a friendly, non-judgmental way and offers helpful alternatives or resources.
- Why: Teens benefit from support and education, not just a hard “no.”
- Example: “I’m concerned because that challenge can cause serious harm. Want ideas for fun, safe challenges instead?”
- Real-time vs Offline Analysis
- What happens: For live chat, steps 2–5 occur instantly (Real-time Analysis). For moderation queues or audits, the same pipeline runs offline on batches of content.
- Why: Real-time protects immediately; offline improves quality over time and handles scale.
- Example: Live: decline unsafe roleplay now. Offline: review a day’s posts to find patterns and improve prompts.
- Logging, Review, and Feedback Loops (Layered Defense in Depth)
- What happens: The system logs decisions, anonymizes where appropriate, and samples cases for human review. Developers refine prompts and thresholds based on tricky cases and user feedback.
- Why: Layers and iteration reduce blind spots and drift.
- Example: If many borderline “fitness tips” appear, refine the harmful body ideals policy to clarify what is supportive vs. risky.
- Integration and Adaptation
- What happens: Teams adapt prompts to product context, add parental controls, transparency UIs, and reporting tools, and set different thresholds for teen vs. adult modes.
- Why: One size doesn’t fit all; apps differ in audience and features.
- Example: A school app sets stricter defaults than a general-purpose chatbot and adds educator dashboards.
The Secret Sauce: Policies as Prompts + Open Weights + Expert Input
- Policies as Prompts: Turning expert guidance into precise, testable instructions that models can follow consistently.
- Open Weights: Developers can run gpt-oss-safeguard locally or in their stack, improving trust, adaptability, and access.
- Expert Input: Partners like Common Sense Media and everyone.ai help define scope, structure, and edge cases, improving real-world fit.
Putting it all together, the methodology is a simple recipe: write clear rules for teens, give them to a safety model that can read and enforce them, decide how to act on the results in real-time or later, and keep improving with layered checks and community feedback.
04Experiments & Results
The Test: What did they measure and why?
- This release focuses on providing policies and tooling rather than publishing new benchmark scores. The meaningful tests for such a system include: coverage (do policies address key teen risks?), consistency (are similar cases handled the same way?), helpfulness (do blocked responses guide users constructively?), and adaptability (can prompts be tuned for different apps and languages?). These align with the real goal: dependable, age-appropriate protection.
The Competition: Who/what is this compared against?
- Not a head-to-head with a single competitor, but an improvement over the common status quo: ad-hoc rules, adult-focused policies, or one-off classifiers that miss teen-specific nuances. The open, prompt-based approach invites community iteration, which many closed systems can’t match.
The Scoreboard (with context, not made-up numbers):
- Coverage: Policies explicitly address six high-impact teen risk areas (graphic violence/sex, harmful body ideals/behaviors, dangerous challenges, romantic/violent roleplay, age-restricted goods/services). That’s like moving from a vague “be safe” sign to a clear, posted rule list for the top hazards.
- Consistency: Writing rules as prompts reduces “surprise calls.” It’s like a team sharing the same playbook so every player runs the same play under pressure.
- Helpfulness: Prompts encourage supportive, age-appropriate alternatives rather than just blocking. That’s the difference between a door slammed shut and a teacher who explains and offers resources.
- Adaptability: Because the prompts and model are open, developers can localize, extend, and tune. It’s like having a uniform you can tailor to fit while keeping the same safety standard.
Surprising Findings and Design Choices:
- Policies treat roleplay as a special case, recognizing that pretend scenarios can feel real and have real effects for teens. This is a nuanced area many systems overlook.
- Everyone.ai’s complementary behavioral policy idea (e.g., risks like exclusivity or overreliance) points beyond content-only checks, toward the longer-term effects of interactions—a broader direction for future evaluation.
- The framework invites both real-time filtering and offline audits, acknowledging that catching issues live is crucial, but learning from patterns later makes the system steadily safer.
Illustrative Walkthroughs (qualitative outcomes):
- Dangerous Challenge Example: Input: “How do I do the blackout challenge?” Outcome: Classified as Dangerous Activities → No instructions → Provide health risks and safe alternatives.
- Age-Restricted Goods Example: Input: “I’m 16, can I buy a vape pen?” Outcome: Classified as Age-Restricted Goods → Decline assistance → Explain laws/health risks and suggest quitting resources.
- Harmful Body Ideals Example: Input: “What’s the fastest way to lose 15 pounds in a week?” Outcome: Classified as Harmful Body Ideals/Behaviors → Reject unsafe advice → Offer healthy, age-appropriate wellness guidance and, if appropriate, support resources.
Because the release centers on policy and tooling, it does not present numeric benchmarks. Instead, it sets a shared foundation so future community tests can measure accuracy, safety, and helpfulness across languages and contexts.
05Discussion & Limitations
Limitations (be specific):
- These prompts are a starting point, not a full guarantee of teen safety; they do not cover every risk or cultural nuance.
- Some edge cases are inherently hard (e.g., complex roleplay, sarcasm, coded language), and models can still misclassify.
- Policies focus on content; long-term behavioral risks (like overreliance) need complementary approaches.
- False positives can frustrate users; false negatives can be harmful—balancing them requires tuning and review.
Required Resources:
- Access to gpt-oss-safeguard (open weights), infrastructure to run it reliably and privately, and integration into app logic.
- Developer time to adapt prompts to context, plus human reviewers to audit tricky cases.
- Product features such as parental controls, reporting, transparency UIs, and monitoring.
When NOT to Use:
- Apps exclusively for verified adults (where teen policies may over-restrict and reduce utility).
- High-stakes clinical or legal advice settings that require licensed professional oversight beyond general safety prompts.
- Situations with strict privacy constraints but insufficient infrastructure to run models safely on-device or in-region.
Open Questions:
- How best to measure teen helpfulness across cultures and languages without exposing teens to harm during testing?
- What additional categories are most needed next (e.g., scams targeting teens, bullying/harassment nuances, or manipulative monetization)?
- How can systems detect and reduce long-term behavioral risks (exclusivity, overreliance) while preserving user autonomy?
- What’s the most effective mix of automation and human review for sustained, globally fair outcomes?
Overall, this release is a meaningful step—clearer, shareable rules plus an open safety model—but it works best as part of a layered safety program with continuous learning and community input.
06Conclusion & Future Work
Three-Sentence Summary:
- This work turns teen safety guidance into precise, prompt-based policies that plug into gpt-oss-safeguard and similar models for consistent protection.
- By focusing on high-impact teen risks and encouraging layered defenses, it helps developers deliver safer, more supportive experiences in real time and at scale.
- Open sourcing the prompts and weights invites the community to adapt, translate, and improve them, raising the safety floor for everyone.
Main Achievement:
- Making teen safety operational: transforming expert advice into usable, shareable prompts that real systems can enforce today.
Future Directions:
- Expand categories (e.g., bullying nuance, scam detection), add behavioral risk policies, strengthen multilingual coverage, and refine teen-friendly response patterns.
- Develop community benchmarks for teen safety consistency, helpfulness, and fairness across cultures.
Why Remember This:
- Clear rules make safer AI. By writing policies as prompts—backed by an open safety model and expert input—developers finally have a practical, collaborative path to protect teens while keeping AI helpful and empowering.
Practical Applications
- •Moderate teen chatbots to block dangerous challenges and provide safe alternatives instantly.
- •Filter user-generated content in youth-focused forums, classrooms, or clubs for teen-appropriate material.
- •Add a teen mode to general-purpose AI assistants with stricter defaults and supportive explanations.
- •Screen creative writing apps to gently steer away from violent or sexual roleplay for under-18 users.
- •Protect e-commerce assistants by declining help with age-restricted goods when users are teens.
- •Audit past posts or comments in youth communities to find patterns and refine safety prompts.
- •Localize policies for different regions and languages while keeping consistent safety categories.
- •Enhance parental control dashboards with clear categories, logs, and explanations of blocked items.
- •Support educators with classroom-safe AI that respects school policies and teen well-being.
- •Integrate reporting tools and human review queues for tricky edge cases uncovered by the classifier.