Introducing the Stateful Runtime Environment for Agents in Amazon Bedrock

OpenAI Blog

Introducing the Stateful Runtime Environment for Agents in Amazon Bedrock

Beginner

OpenAI Blog2/27/2026

Key Summary

•This paper introduces a Stateful Runtime Environment that runs inside Amazon Bedrock so AI agents can handle long, multi-step jobs safely and reliably.
•Instead of forgetting between steps, agents keep a 'working context' that remembers history, tool outputs, and permissions so they don’t get lost.
•It reduces the heavy lifting for developers by handling orchestration, errors, approvals, and restarts inside the customer’s AWS environment.
•Compared to stateless APIs, this approach fits real business work that spans many tools, needs audits, and must follow security rules.
•The runtime is optimized for AWS services and makes it easier to meet governance and compliance needs.
•Teams can ship customer support, sales ops, IT automation, and finance workflows to production faster.
•There are no public benchmarks yet, but the method lays out how to measure reliability, cost, and safety once available.
•Limitations include AWS dependence, unknown performance numbers, and the need to define guardrails and approvals well.
•Overall, it’s a shift from quick demos to dependable, long-running agents suitable for real companies.

Why This Research Matters

Real businesses need AI that can do more than answer one-off questions—they need dependable helpers that finish entire jobs across days or weeks. A stateful runtime means agents remember what happened, follow rules, and ask for permission when actions are risky. Running inside Amazon Bedrock lets companies use their existing AWS security and governance, which lowers risk and speeds up adoption. This can shorten support wait times, reduce IT backlogs, and make finance approvals cleaner and auditable. By reducing fragile glue code, teams can focus on the business logic that makes them unique. The result is faster time to production with safer, more traceable automation.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how building a cool LEGO model at home is easy, but making thousands of the same model in a factory—on time, safely, and with quality checks—is much harder? Demos are fun; production is serious work.

🍞 Hook: Imagine you ask a helper to plan your birthday party. If the helper forgets what cake you like every time you talk, you’ll keep repeating yourself. 🥬 The Concept (Stateless API): A stateless API is a way of talking to an AI where each message is treated like the first time you ever met.

How it works: 1) You send a prompt. 2) It answers. 3) It forgets almost everything right after. 4) If you want it to remember, you must resend context each time.
Why it matters: Without memory, multi-step jobs break because the AI loses track and you must glue everything together manually. 🍞 Anchor: Asking “What’s 2+2?” once is fine. But planning a vacation over a week with many changes? Stateless gets messy quickly.

Before this work, many AI agents were great at thinking for one turn—answer a question, call a single tool, done. But real business work isn’t one-and-done. It’s more like a scavenger hunt across systems: check a database, create a ticket, email a manager, wait for approval, update records, and report what happened. Each step depends on the last. If the helper forgets, or mixes up who’s allowed to do what, things fall apart.

🍞 Hook: You know how a chef keeps a recipe card, preps ingredients, and remembers which pan is hot? 🥬 The Concept (Stateful Runtime Environment): A Stateful Runtime Environment is a place where agents can work while remembering what happened before, what tools they used, and what they’re allowed to do.

How it works: 1) It stores the “working context” (history, tool results, decisions). 2) It organizes steps in order. 3) It keeps identity and permissions attached. 4) It can pause, resume, retry, and audit.
Why it matters: Without it, teams must build custom glue code for memory, errors, and approvals, which is fragile and slow. 🍞 Anchor: It’s like a kitchen that keeps your recipe, the chopped veggies, the timers, and who’s allowed to use the sharp knives—all in one place.

The problem researchers faced was operational, not just intellectual: How do we run multi-step work reliably over time, across real tools, with the right controls? Earlier attempts chained stateless calls, added ad‑hoc memory logs, or used generic workflow tools. These helped, but integrations were brittle, context got out of sync, and security/audit needs were bolted on late.

🍞 Hook: Think about building a Rube Goldberg machine—lots of parts in a row, each depending on the last. 🥬 The Concept (Multi-step Workflow): A multi-step workflow is a job completed through several ordered tasks, where each step may depend on earlier results.

How it works: 1) Plan steps. 2) Execute a step. 3) Save what happened. 4) Decide the next step using saved info. 5) Repeat until done.
Why it matters: If steps run out of order or forget past results, the final outcome can be wrong. 🍞 Anchor: Returning an online order might involve verifying identity, checking inventory, issuing a refund, and emailing a receipt—in that order.

🍞 Hook: A band sounds great when someone leads them. 🥬 The Concept (Orchestration Layer): The orchestration layer is the “conductor” that schedules tasks, calls tools, handles timeouts, and decides what to do next.

How it works: 1) Reads plan. 2) Calls tools with the right inputs. 3) Waits or retries. 4) Records progress. 5) Moves to the next step.
Why it matters: Without it, steps collide, tools get wrong inputs, and errors pile up. 🍞 Anchor: Like a conductor cueing violins, then drums, then flutes, so the song makes sense.

🍞 Hook: Teachers keep notes so they remember each student’s progress. 🥬 The Concept (Context Management): Context management keeps track of history, tool outputs, decisions, and references so the agent doesn’t lose the thread.

How it works: 1) Capture each step’s inputs/outputs. 2) Summarize or index them. 3) Provide the right slice of context to the next step. 4) Archive for audit.
Why it matters: Without good context, the agent repeats work, contradicts itself, or uses stale info. 🍞 Anchor: If you forgot you already did your homework, you might do it twice—or not at all.

🍞 Hook: Seatbelts and speed limits keep car rides safe. 🥬 The Concept (Guardrails in Secure Environments): Guardrails are rules and protections that keep actions safe, allowed, and traceable.

How it works: 1) Check permissions. 2) Enforce policies. 3) Require approvals for risky steps. 4) Log everything.
Why it matters: Without guardrails, agents could leak data, overspend, or make unauthorized changes. 🍞 Anchor: Like a balcony rail that lets you enjoy the view without falling.

🍞 Hook: Think of Amazon Bedrock as a mall where many helpful AI stores live under one roof. 🥬 The Concept (Amazon Bedrock): Amazon Bedrock is an AWS service that hosts AI models and tooling so companies can build AI apps within their own cloud environment.

How it works: 1) Choose models. 2) Connect tools. 3) Run agents in your AWS setup. 4) Use AWS security and monitoring.
Why it matters: Building inside your own cloud helps meet security and governance needs. 🍞 Anchor: It’s like opening your store inside a safe, well-managed mall instead of a random street corner.

The gap this paper fills is unifying memory, orchestration, identity, and guardrails into a single, AWS‑native runtime so agents can do long, real jobs. The stakes are big: faster customer help, cleaner approvals in finance, fewer IT tickets, and lower risk because every action is monitored and governed.

02Core Idea

The “aha!” moment: Move from short, stateless chats to a stateful, governed runtime that lives inside your AWS environment, so agents can remember, orchestrate, and safely finish real multi-step work over time.

Explain it three different ways:

Kitchen analogy: Before, the chef had to fetch the recipe, re-chop onions, and reheat pans every time you asked for a step. Now, the kitchen keeps the recipe, prepared ingredients, hot pans, and safety rules ready between steps.
Field trip analogy: Before, each permission slip and headcount was redone at every bus stop. Now, a clipboard tracks who’s on board, where you’ve been, the next stop, and which stops need a chaperone’s approval.
Relay race analogy: Before, runners dropped the baton between legs. Now, the baton (state) is securely handed off, with a coach (orchestrator) calling plays and a referee (guardrails) ensuring fair rules.

Before vs After:

Before: Chain prompts, hope tools cooperate, build custom logs, add manual approvals, and pray it resumes after a crash.
After: The runtime stores the working context, sequences steps, enforces permissions, pauses/resumes safely, and records an audit trail—inside your AWS environment.

🍞 Hook: You know how a travel folder holds tickets, hotel bookings, and IDs all in one place? 🥬 The Concept (Working Context): Working context is the bundle of memory/history, tool results, decisions, identities, and environment details an agent carries from step to step.

How it works: 1) Capture inputs/outputs each step. 2) Attach identity/permissions. 3) Version and checkpoint. 4) Retrieve only what’s needed next.
Why it matters: Without a solid working context, agents re-ask, misuse tools, or violate policies because they lack the right info at the right time. 🍞 Anchor: It’s like a trip wallet with your passport, tickets, and hotel vouchers—you move smoothly through checkpoints.

Why it works (intuition, not equations):

Treating state as a first-class citizen lets planning and execution feed each other reliably.
Identity-aware steps ensure the right permissions apply every time, preventing accidental overreach.
Integrated guardrails and approvals turn risky operations into supervised ones.
Checkpoints and retries tame long-horizon tasks that cross time boundaries and failures.
Running within AWS governance aligns with how enterprises already secure data and tools.

Building blocks (the idea in smaller pieces):

Memory/State store: Saves checkpoints so you can pause/resume and audit.
Tool invocation manager: Calls external systems with timeouts, retries, and typed inputs/outputs.
Planner/executor: Decides next steps and executes them in order.
Identity & permissions: Keeps actions within user or service boundaries.
Guardrails & approvals: Policies that block, allow, or require humans-in-the-loop.
Audit & observability: Logs every decision and effect.
AWS-native deployment: Lives where your data and controls already are.

🍞 Hook: Like a school ID card that opens certain doors but not others. 🥬 The Concept (Identity/Permission Boundaries): These boundaries define what the agent can and cannot do, based on who it acts for.

How it works: 1) Bind a user or role to a session. 2) Check policies per action. 3) Escalate when needed. 4) Log decisions.
Why it matters: Without boundaries, a simple request could trigger unauthorized or costly actions. 🍞 Anchor: A cafeteria pass gets you lunch, not the principal’s office keys.

🍞 Hook: Rules that say, “Measure twice, cut once.” 🥬 The Concept (Approvals and Audits): Approvals require a person to okay risky steps; audits record what happened for later review.

How it works: 1) Tag sensitive actions. 2) Pause and notify approvers. 3) Record who approved what, when, and why. 4) Resume with traceability.
Why it matters: Without this, you can’t prove compliance or learn from mistakes. 🍞 Anchor: Like a permission slip signed before the field trip bus departs.

03Methodology

At a high level: Request → Initialize Working Context → Plan → Call Tools (with guardrails) → Checkpoint & Log → Human Approval (if needed) → Resume/Retry/Wait → Finalize & Report.

Step-by-step, like a recipe:

Input the request

What happens: A user or system asks the agent to do something, e.g., “Refund order 98765 and notify the customer.” The request arrives with identity (who’s asking) and any starting data.
Why this exists: Without a clear entry and identity, the system can’t apply the right permissions or track who owns the result.
Example: User Alice from Support submits ticket T-4321 to process a $120 refund.

Initialize the Working Context

What happens: The runtime opens or creates a state record for this job, loading history, related tickets, relevant policies, and user/role information. It attaches a unique job ID and timestamps.
Why this exists: Without a single source of truth, the agent could duplicate steps, miss policies, or mix up identities.
Example: Context includes prior chat with the customer, last payment attempt result, and Alice’s support-role permissions.

Plan the workflow

What happens: The agent (guided by the runtime) breaks the job into steps: verify order, check payment status, compute refund, seek approval if >$100, issue refund, email receipt, update CRM.
Why this exists: A plan prevents random tool calls and keeps the process predictable and auditable.
Example: The plan marks “approval required” because $120 exceeds the threshold.

Guardrails and policy check

What happens: Before executing each step, the runtime evaluates policies: data access limits, spending caps, PII handling rules, and whether human approval is needed.
Why this exists: Without pre-checks, the agent might do forbidden or high-risk actions.
Example: Policy says any refund >$100 needs a manager’s sign-off and sensitive data must be masked in logs.

Tool selection and invocation

What happens: The orchestration layer calls the right tools with structured inputs, sets timeouts/retries, and validates outputs. It may call payment APIs, order databases, email services, or ticketing systems.
Why this exists: Direct, unmanaged tool calls cause brittle failures and inconsistent data.
Example: Call Payments.refund(order=98765, amount=120.00); on timeout, retry with backoff up to 3 times.

Context management and checkpointing

What happens: Each step’s inputs/outputs are saved. The runtime creates a checkpoint so the job can resume from here if interrupted.
Why this exists: Without checkpoints, a crash means starting over and risking double charges or missing steps.
Example: After a successful refund, store transaction ID TXN-55A1 and the exact timestamp.

Human-in-the-loop approvals (conditional)

What happens: If a step is tagged as risky, the runtime pauses, notifies an approver, and waits. The context includes what’s being approved and why.
Why this exists: Removes the need for manual side-channels and ensures oversight for sensitive actions.
Example: Manager Bob receives a prompt: “Approve $120 refund for order 98765? Evidence: Payment cleared, customer complaint severity high.”

Resume, retry, or rollback

What happens: After approval or on error, the runtime resumes at the latest checkpoint. It retries transient failures, or rolls back where possible.
Why this exists: Real systems fail—networks hiccup, APIs timeout. Smart recovery keeps work moving safely.
Example: If email service fails, retry with backoff or switch to a backup channel; don’t re-issue the refund.

Identity and permission enforcement throughout

What happens: Each action runs under the correct identity/role. Escalations are explicit and logged.
Why this exists: Prevents accidental over-permissioned calls that could leak data or overspend.
Example: Tool calls to payments run under the “support-refund-limited” role; accessing PII requires masked views.

Observability and audit logging

What happens: The runtime records a tamper-resistant log: who/what acted, inputs/outputs (appropriately redacted), approvals, errors, and timing.
Why this exists: Without traceability, you can’t prove compliance, debug issues, or improve the workflow.
Example: Log shows: Alice initiated, Bob approved, refund TXN-55A1 issued at 14:03:12Z, email sent at 14:04:09Z.

Finalize and report

What happens: The agent compiles a final summary and artifacts (receipts, CRM notes) and closes the job.
Why this exists: Stakeholders need a clear outcome and evidence.
Example: Customer gets a receipt; CRM and ticket T-4321 are updated with the resolution.

AWS-native deployment and governance

Operate inside your AWS environment so existing security posture, monitoring, and integrations apply. This reduces the need to move data elsewhere and aligns with enterprise governance.

The secret sauce

State as a first-class citizen: Memory, identity, and checkpoints are built-in, not bolted on.
Identity-aware orchestration: Every tool call knows who it’s acting as and what policies apply.
Integrated guardrails: Approvals, limits, and redaction happen automatically, not as afterthoughts.
Long-horizon resilience: Pause/resume, retries, and idempotency prevent duplicate or missing actions.
AWS alignment: Works with customers’ existing cloud controls and tools, reducing friction to production.

04Experiments & Results

What did they test? The paper is an introduction and does not publish formal benchmarks. Still, here’s how meaningful tests would look once available:

Reliability of completion

Measure: Percentage of workflows that finish successfully without human help (for low-risk flows) and with proper approvals (for high-risk flows).
Why: Shows whether orchestration, retries, and checkpoints really reduce failures.

Error recovery and resume success

Measure: How often a run recovers after a tool error or timeout, and at which checkpoint it resumes.
Why: Long-horizon tasks must survive real-world hiccups.

Governance quality

Measure: Policy violations prevented, correctness of approval enforcement, and audit completeness scores (e.g., all sensitive steps logged with redaction).
Why: Proves guardrails are working as intended.

Efficiency and cost

Measure: Average steps to completion, time-to-resolution, tool-call latency, and cost per workflow.
Why: Production systems must be economical and timely.

Developer productivity

Measure: Time from idea to production, lines of glue code removed, incidents per quarter.
Why: The runtime’s promise is faster, safer shipping.

The competition (baselines)

Stateless chaining: Prompt → Tool → Prompt, with ad-hoc memory and manual retries.
Generic workflow engines without stateful agent support: Good at timers and branches, weaker at LLM context and tool semantics.
Custom in-house runtimes: Can be strong but costly to build and maintain.

Scoreboard with context (no fabricated numbers)

Think of an ‘A+’ as: High completion rates, clean audits, few policy violations, and graceful recovery. Many current stateless setups score closer to ‘B-’: they work for demos but need lots of babysitting in production. The Stateful Runtime aims to push these marks into the ‘A’ range by making memory, guardrails, and orchestration native features.

Surprising or notable findings to watch for

Checkpoint granularity: Fewer, bigger checkpoints are faster but risk more rework; more, smaller checkpoints add safety but increase overhead.
Human-in-the-loop placement: Early approvals can block wasteful work; late approvals can reduce back-and-forth—finding the right spot matters.
Tool heterogeneity: Diverse tools with inconsistent schemas stress the tool invocation manager; adapters and validation become critical.
Cost dynamics: Long-running state and logs add storage/compute costs, but reduced incidents and support time may outweigh them.
Summarization drift: As context is summarized over time, accuracy can drift; policies for refreshing ground-truth matter.

Because the paper doesn’t include public metrics, treat these as a rubric to evaluate pilots rather than results already claimed.

05Discussion & Limitations

Limitations

AWS dependence: It’s designed to run inside customers’ AWS environments. If you’re multi-cloud or on-prem only, integration may be limited.
No published benchmarks yet: Reliability, latency, and cost improvements are not quantified publicly.
Governance complexity: Defining good policies and approvals is hard; weak rules can either block progress or allow risky actions.
Long-horizon cost/complexity: Storing state, logs, and checkpoints over time incurs costs and requires lifecycle policies.
Tool variability: Inconsistent tool APIs and data quality can still cause brittle edges, even with a strong runtime.

Required resources

An AWS environment with access to Amazon Bedrock.
Defined identities/roles and policies (who can do what).
Integrations to business tools (payments, CRM, ticketing, email, databases).
Observability setup to watch logs, metrics, and alerts.
A cross-functional team (engineering, security, compliance) to encode guardrails and approvals.

When NOT to use it

One-shot Q&A or simple lookups where stateless calls are cheaper and simpler.
Ultra-low-latency microservices needing strict determinism and millisecond SLAs.
Workflows that cannot run in AWS due to data residency or vendor constraints.
Early ideation where speed of a quick prototype matters more than governance.

Open questions

Portability: Can state and policies be ported across clouds or vendors without rework?
Formal guarantees: How far can we go toward verifiable safety for critical actions?
Scaling limits: How big can working context grow before performance degrades, and what summarization strategies work best?
Pricing and TCO: What’s the long-term cost profile versus DIY orchestration?
Developer ergonomics: What abstractions make approvals, retries, and tool schemas easy but safe?
Interoperability: How smoothly can it integrate with diverse enterprise tools and data contracts?

06Conclusion & Future Work

In three sentences: This paper introduces a Stateful Runtime Environment that runs natively in Amazon Bedrock so agents can remember, orchestrate, and safely complete complex, multi-step work. It shifts focus from demo-grade, stateless prompts to production-grade workflows with built-in state, guardrails, and governance. Although public benchmarks aren’t provided yet, the design targets reliability, compliance, and faster time to value for real businesses.

The main achievement is unifying working context, orchestration, identity, and guardrails into a single runtime aligned with AWS infrastructure, reducing the custom glue code teams usually build.

Looking ahead, we can expect measured evaluations of reliability and cost, richer guardrail languages, better portability, and tools that make building approved workflows as simple as writing a checklist. If you remember one thing, remember this: making state a first-class citizen—inside a governed environment—is the key that turns clever agents into trustworthy coworkers.

Practical Applications

•Multi-system customer support that verifies orders, issues refunds, and updates CRM with audit logs.
•Sales operations that generate quotes, check inventory, request discounts with approval, and send contracts.
•IT automation that resets passwords, provisions accounts, and opens tickets with safe permission boundaries.
•Finance workflows that route invoices, validate spend, request approvals, and post to accounting systems.
•HR onboarding that creates accounts, ships equipment, assigns training, and confirms completion.
•Procurement that compares vendors, raises purchase orders, and enforces budget policies.
•Supply chain exception handling that reroutes shipments and notifies stakeholders with full traceability.
•Marketing campaign setup that drafts assets, schedules posts, gets legal sign-off, and publishes.
•DevOps runbooks that roll back deployments, rotate keys, and notify teams with guardrails.
•Healthcare admin tasks like eligibility checks and prior authorizations with required approvals and logs.

Version: 1