Rakuten fixes issues twice as fast with Codex

OpenAI Blog

Rakuten fixes issues twice as fast with Codex

Beginner

OpenAI Blog3/11/2026

Key Summary

•Rakuten plugged OpenAI’s Codex (a coding agent) into daily engineering work to fix problems faster, ship code more safely, and build bigger features with less hand-holding.
•They cut the time it takes to recover from outages roughly in half, meaning problems that used to take hours now take about half as long to fix.
•Codex helps write and tune KQL queries to spot root causes quickly, so engineers spend more time fixing and less time hunting.
•In the CI/CD pipeline, Codex performs automatic code reviews and security checks using Rakuten’s own internal rules, so quality doesn’t drop as speed increases.
•For larger, fuzzy projects, Codex can move from a rough spec to working apps (backend plus iOS frontend), shrinking quarter-long builds into weeks.
•Engineers shift from writing every line to defining clear goals and verifying results, which scales impact across 30,000 employees.
•This approach blends speed, safety, and autonomy into one system—so improvements compound instead of colliding.
•The main lesson: an AI agent that can read logs, review code, and build features end-to-end can remove long-standing bottlenecks in modern software teams.

Why This Research Matters

When websites or apps break, every minute counts for users and businesses. By cutting recovery time roughly in half, customers can finish checkouts, transfers, or bookings instead of giving up. Automatic reviews and security checks mean fewer scary surprises after release, protecting data and trust. Turning rough ideas into working software faster helps teams respond to market changes, support new devices, and help more users. And shifting engineers from typing every line to defining goals and verifying results scales their impact across big organizations. This blend—speed, safety, and autonomy—sets a new normal for how modern software is built and maintained.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a school has different teams—teachers, coaches, nurses—all trying to help students learn safely and quickly? Big tech companies work the same way with their software: many teams must keep things fast and safe at the same time. Before this work, teams often had to choose: hurry up and risk mistakes, or slow down to be extra careful. That tug-of-war wasted time and energy.

🍞 Hook: Imagine you have a smart helper who can fix your bike, check it’s safe, and even build a new scooter from your sketch—all without slowing you down. 🥬 The Concept: Codex is that smart helper for software teams: an AI coding agent that can read, write, and review code, and help diagnose problems. How it works (simple steps):

Understands goals and context (like reading a task list and house rules).
Reads data (logs, code, tests) and chooses the right tools (like KQL for logs).
Proposes changes or solutions and explains why.
Verifies with tests and checks against company standards. Why it matters: Without a helper like this, teams get stuck switching between hunting for clues, fixing, and reviewing, which slows everything down. 🍞 Anchor: A service breaks at 2 a.m.; Codex reads the error logs, pinpoints a bad database query, suggests a safer fix, and drafts the pull request so the on-call engineer can quickly verify and deploy.

The world before: Engineers did lots of manual steps. They wrote complicated queries to search logs, dug through dashboards, drafted patches, waited for human reviewers, and ran security scans. Each handoff added friction. If the original problem was fuzzy, people paused until someone wrote a perfect specification. That meant long recoveries during outages and slow progress on big projects.

🍞 Hook: You know how a sports team practices passing so the ball moves smoothly down the field? Software teams need a smooth path from writing code to giving it to users. 🥬 The Concept: CI/CD (Continuous Integration/Continuous Deployment) is the automated assembly line that checks, tests, and delivers code to users. How it works:

Developers share code often (integration).
Automated tests run automatically.
Approved changes roll out to users (deployment). Why it matters: Without CI/CD, code piles up, mistakes hide longer, and shipping slows to a crawl. 🍞 Anchor: A student turns in homework every day instead of once a month; the teacher can correct mistakes quickly so learning speeds up.

🍞 Hook: When you lose a toy at home, you ask smart questions to find it. Logs are like the clues to your toy’s location. 🥬 The Concept: KQL (Kusto Query Language) is how engineers ask big log systems, “Where did things go wrong?” How it works:

Point at the right data (like all errors in the last hour).
Filter and group (which service failed most?).
Spot patterns (spikes, time ranges, user impact). Why it matters: Without good questions, you chase the wrong clues and waste time. 🍞 Anchor: “Show errors from the checkout service after 5 p.m.” quickly reveals a sudden spike that broke payments.

The problem: Even with CI/CD and KQL, people still stitched everything together by hand. During incidents, engineers juggled alerts, logs, and patches. In code review, humans became bottlenecks as changes piled up. For big, fuzzy projects, teams waited for perfect specs before moving, so work stalled.

🍞 Hook: When a fire alarm rings, you want firefighters who know exactly what to do, fast. 🥬 The Concept: Incident response is the playbook for detecting, diagnosing, fixing, and learning from software problems quickly. How it works:

Detect alerts and confirm impact.
Diagnose the root cause.
Remediate with a safe fix.
Review what happened to prevent repeats. Why it matters: Without a clear response, outages last longer and users lose trust. 🍞 Anchor: A login service fails; a runbook guides the team to roll back a bad change in minutes, not hours.

🍞 Hook: A car that gets regular tune-ups runs smoothly for years. 🥬 The Concept: Site Reliability Engineering (SRE) keeps services reliable by mixing software engineering with operations. How it works:

Define reliability goals (like target uptime).
Measure reality with logs and metrics.
Automate fixes and guardrails.
Improve with post-incident learning. Why it matters: Without SRE, systems drift, outages grow, and user happiness drops. 🍞 Anchor: SREs set an error budget and use it to decide when to ship features versus harden reliability.

Failed attempts: Teams tried adding more dashboards, more manual checklists, and more reviewers. This helped a bit, but created more to manage. People also tried rigid rules that required perfect specs before starting work—good for safety, bad for speed. The missing piece was an agent that could understand context, follow company rules, and act across logs, code, and tests.

🍞 Hook: Wouldn’t it be great if a robot vacuum also checked the doors were locked and watered the plants—without you asking every detail? 🥬 The Concept: Automation in software development lets tools do repeated, careful tasks reliably. How it works:

Capture a repeatable task (like running tests).
Encode rules and standards.
Trigger it at the right time (like on each pull request).
Log results so people can verify. Why it matters: Without automation, humans do boring, error-prone work and slow everything down. 🍞 Anchor: Every time code changes, tests and linters run automatically so problems are caught early.

🍞 Hook: A security guard checks doors and windows every night. 🥬 The Concept: Vulnerability checks scan code and configurations to find weaknesses attackers could exploit. How it works:

Look for known risky patterns (like hardcoded secrets).
Check dependencies for reported flaws.
Suggest safer alternatives. Why it matters: Skip this, and a tiny mistake can become a big breach. 🍞 Anchor: A pull request is blocked because it sends passwords over plain HTTP; the tool suggests HTTPS and a secure library.

The gap: Rakuten needed an AI agent that could plug into incident response (log querying and fixes), CI/CD (reviews and security), and development (moving from fuzzy specs to working software). It had to be fast but safe, and it had to learn Rakuten’s own standards.

Real stakes: Outages can cost sales and trust. Slow reviews make good ideas miss the moment. Overly strict rules can freeze progress. By adding Codex as an always-on teammate, Rakuten aimed to cut recovery times, keep security high, and speed up projects—even when requirements aren’t perfectly written yet.

02Core Idea

The “Aha!” in one sentence: Treat an AI coding agent (Codex) as a dependable teammate that can read the clues (logs), fix the issue (code), and check the rules (standards) across the whole software life cycle, so speed, safety, and autonomy rise together instead of fighting each other.

Three analogies to feel it:

Air traffic control + autopilot: Humans set the destination and rules; Codex keeps watch on instruments (logs), adjusts the course (code changes), and asks for help when weather (incidents) gets rough.
Factory with smart stations: Every station (tests, security, review) has a helpful robot that knows company rules. Products (code) move faster with fewer defects.
Detective + mechanic: Codex gathers clues (KQL queries), finds the culprit (bug), then fixes the engine (code) and runs a safety check before driving off (deployment).

Before vs After:

Before: Manual log-diving, slow reviews, stalled projects needing perfect specs.
After: Codex drafts queries, summarizes root causes, proposes compliant fixes, and turns partial specs into working backends and mobile apps—while CI/CD automatically enforces Rakuten’s rules.

Why it works (intuition):

Context fusion: The same agent sees logs, code diffs, and standards, so it connects dots that separate tools miss.
Continuous verification: Suggestions are run through tests, security scanners, and style rules, catching mistakes early.
Company brain: Feeding internal principles to Codex gives it a playbook, so “faster” doesn’t mean “sloppier.”
Autonomy with guardrails: Codex moves projects forward but pauses for human sign-off where risk is higher.

Building blocks (each with a mini sandwich):

🍞 Hook: You know how a recipe includes both ingredients and house rules (like no peanuts)? 🥬 The Concept: Internal standards are Rakuten’s coding and security rules that Codex follows. How it works: Collected documents are given to Codex; it checks proposed code against them and flags mismatches. Why it matters: Without standards, speed can drift into unsafe territory. 🍞 Anchor: A PR adding a new API is flagged for missing input validation per Rakuten’s guideline.

🍞 Hook: When you’re stuck on a puzzle, you try one move, see what happens, then adjust. 🥬 The Concept: Plan–execute–verify loops let Codex propose, test, and refine changes. How it works: Draft fix → run tests/scans → compare to goals → iterate. Why it matters: Without verify loops, mistakes slip into production. 🍞 Anchor: A slow SQL query is rewritten; tests confirm performance and correctness before merge.

🍞 Hook: If you can both read a map and drive, you reach places faster. 🥬 The Concept: Tool use means Codex invokes KQL, linters, test runners, and CI tasks directly. How it works: It selects a tool, crafts inputs, runs it, and reads outputs. Why it matters: Without tools, the agent guesses instead of measuring. 🍞 Anchor: Codex runs a linter, sees a security warning, and corrects the code.

Together, these pieces flip the script: instead of waiting for perfect instructions, Codex moves first within guardrails, then humans verify, making throughput jump without losing safety.

03Methodology

At a high level: Input (alerts, code changes, or a rough spec) → Codex understands context and company rules → Chooses tools (KQL, tests, scanners) → Proposes actions (queries, fixes, code) → Verifies against standards and tests → Produces outputs (root-cause summary, reviewed PR, or working app) → Human verifies and deploys.

We’ll walk through three recipes: incident response, safe CI/CD, and autonomous builds.

Recipe 1: Compressing incident response

What happens: An alert fires (say, checkout errors spike). Codex pulls related logs and metrics, drafts KQL queries, and summarizes patterns (which service, when it started, probable root cause). It proposes a minimal, reversible fix (feature flag change, rollback, or code patch) and drafts a PR with tests.
Why this step exists: During incidents, the slowest part is figuring out what changed and why. Automating the early digging shortens the path to a fix.
Example with actual data: Logs show repeated timeouts when calling Payments API from Checkout after 17:05. Codex drafts KQL: errors in checkout-service where dependency=payments and timestamp>17:00, grouped by version. It finds version 2.3.7 is failing most. It suggests rolling back checkout-service to 2.3.6 and adding a timeout+retry around the payments call, then opens a PR that adjusts a config flag and adds tests asserting retries.
What breaks without it: Humans hand-craft queries, context-switch across tools, and take longer to prepare a safe roll-forward or rollback.

Steps in detail:

Triage: Gather alerts, recent deploys, change logs. Codex correlates “what changed” with “when errors rose.” Without this, you chase random leads.
Query generation: Codex writes KQL to slice data by service, version, region. Without this, you might miss a regional or version-specific bug.
Root-cause hypothesis: It explains likely causes and asks for confirmation. Without hypotheses, fixes become guesswork.
Safe action plan: It proposes rollback/feature toggle/code patch plus validation steps. Without a plan, you risk making it worse.
Draft PR + tests: It creates a minimal PR and unit/integration tests. Without tests, you can’t verify the fix truly works.
Verification: Run CI, canary deploy if needed, and watch metrics. Without verification, you ship blind.

Recipe 2: Building safer by invoking Codex in CI/CD

What happens: On every pull request, Codex reviews diffs with Rakuten’s rules, runs vulnerability checks, and writes concrete feedback (what’s wrong, why, and suggested fix). It can auto-fix simple issues.
Why this step exists: Human reviewers get tired and busy; automated checks keep quality consistent.
Example with actual data: A PR adds a FastAPI endpoint that returns user data. Codex flags missing input validation, suggests pydantic models, and notes a logging statement prints PII. It proposes a sanitized logger and updates unit tests accordingly.
What breaks without it: Security flaws and style drift sneak in; review queues grow.

Steps in detail:

Standards retrieval: Load internal guidelines (auth patterns, logging rules, dependency policies). Without this, advice is generic.
Static analysis + pattern checks: Look for risky code paths, missing validation, weak crypto, or secrets. Without this, only surface issues are caught.
Dependency scan: Check libraries for known CVEs and suggest upgrades. Without it, you ship known vulnerabilities.
Context-aware suggestions: Align fixes to house style and architecture. Without context, changes won’t match how the company builds.
Auto-fix + comment: Apply safe refactors (e.g., escaping output), create review comments for higher-risk items. Without auto-fix, humans waste time on nits.
Gatekeeping: Fail the check if high-risk items remain. Without gates, issues slip into production.

Recipe 3: Autonomy for larger, fuzzy projects

What happens: Given a rough spec (“Make a mobile version of our web AI agent”), Codex asks clarifying questions, proposes an architecture, scaffolds code (Python/FastAPI backend, Swift/SwiftUI app), implements APIs, and connects the pieces. It writes tests and a small demo plan.
Why this step exists: Perfect specs are rare; waiting for them stalls progress.
Example with actual data: Spec says: endpoints for chat, history, and user auth. Codex designs /api/chat (POST), /api/history (GET), /api/login (POST), implements FastAPI routes with JWT auth, and writes a SwiftUI app that calls these endpoints. It adds rate limiting and input validation following Rakuten’s rules.
What breaks without it: Teams spend weeks drafting documents instead of building.

Steps in detail:

Spec digestion: Turn the rough goal into user stories and acceptance tests. Without this, you can’t measure done-ness.
Architecture proposal: Choose stack and boundaries (e.g., FastAPI for backend, SwiftUI for iOS). Without boundaries, code becomes tangled.
Scaffolding: Generate folders, modules, CI definitions, and sample tests. Without scaffolding, teams reinvent setup each time.
Iterative implementation: Build endpoints/screens in small slices, run tests, and integrate. Without slicing, bugs pile up.
Security and performance passes: Apply standard auth, rate limits, and basic caching. Without passes, you fix security late.
Demo + docs: Create a minimal walkthrough and deployment notes. Without docs, handoffs hurt.

The secret sauce

Company-tuned guardrails: Codex doesn’t just write code; it enforces Rakuten’s own rules so speed stays safe.
Log-to-fix loop: By pairing KQL insights with code edits, Codex shortens the most painful part of outages.
End-to-end autonomy: Moving from partial specs to running apps turns blank pages into shipping products fast, without waiting for perfect instructions.
Human verification: Engineers still measure and approve, keeping accountability where it belongs.

04Experiments & Results

The test: Rakuten focused on outcomes that matter in real life—how fast problems get fixed, how safely code ships, and how quickly big projects go from idea to reality.

What they measured and why

Mean Time to Recovery (MTTR): Shorter MTTR means users wait less during outages.
CI/CD safety and speed: How often reviews catch issues early and how much they relieve human bottlenecks.
Autonomy in development: Can Codex turn partial specs into working systems without step-by-step instruction?

The competition (baselines)

Human-only incident response: Engineers write queries, hunt root causes, and draft fixes manually.
Traditional CI/CD: Standard linters and scanners without an AI reviewer tuned to company rules.
Spec-first projects: Teams wait for perfect requirements before building.

The scoreboard with context

MTTR: Rakuten reports about a 50% reduction. In math terms, the new recovery time equals the old one multiplied by one minus one-half: $T_{new} = T_{old} \times (1 - 0.5)$ . For example, if an outage used to take 6 hours to fix, $T_{new} = 6 \times (1 - 0.5) = 3$ hours, so recovery drops from 6 hours to 3 hours.
Project lead time: Quarter-long builds (roughly 12 weeks) shrink to weeks. One way to see this is a quarter of the original time: $L_{new} = 0.25 \times L_{old}$ . For example, if a project took 12 weeks before, $L_{new} = 0.25 \times 12 = 3$ weeks, so delivery moves from 12 weeks to about 3 weeks in the example.

Make the numbers meaningful

Cutting MTTR by half is like turning a long traffic jam into a much shorter one—users still notice, but fewer miss their flights.
Shrinking a quarter into weeks is like finishing a school group project before midterms instead of cramming at the end—more time for polishing and testing.
Automated code review that uses house rules is like having a coach who knows your team’s exact playbook, not just generic tips; it prevents rework and keeps style consistent.

Surprising findings

Partial specs are often “good enough” to start: Codex can infer missing details from patterns, company rules, and similar services, then ask clarifying questions only where needed.
Review consistency improves: By applying the same internal standards every time, Codex reduces subjective back-and-forth on nits, freeing human reviewers for design and risk.
Incident learning gets easier: Because Codex documents the queries, hypotheses, and fixes, post-incident reviews have better evidence, which helps prevent repeats.

Caveats about measurement

MTTR varies by incident type; halving a small 20-minute issue matters less than halving a day-long outage. Still, across many incidents, the average drops substantially.
“Quarters-to-weeks” depends on scope; greenfield mobile apps paired with existing backends benefit most, while deep platform changes may see smaller gains.
Security findings depend on the maturity of existing checks; environments with weak baselines will see larger improvements when Codex is added.

05Discussion & Limitations

Limitations

Context gaps: If logs, runbooks, or standards are missing or outdated, Codex may suggest fixes that don’t fit reality. High-quality inputs still matter.
False positives/negatives: Automated reviews can over-flag harmless code or miss subtle design risks that require architectural judgment.
Complex rollouts: Multi-region, multi-service deployments need careful coordination; Codex can propose steps, but humans should own the release plan.
Domain-specific quirks: Niche protocols, unusual data pipelines, or strict compliance zones may require custom tooling and extra guardrails.
Security and privacy: Supplying company code and logs to an AI requires governance, auditing, and data minimization.

Required resources

Integrated tooling: Access to logs (KQL), code repos, CI/CD, test suites, and vulnerability databases.
Standards library: Clear, searchable internal rules for coding, auth, logging, and dependencies.
Observability: Good metrics and traces so incident hypotheses can be tested quickly.
Human-in-the-loop: Engineers to set goals, approve changes, and refine standards.

When not to use

One-off scripts or tiny changes where setup overhead outweighs benefits.
Highly novel research code with no tests, where verification is impossible.
Ultra-sensitive environments without approved data-sharing pathways.

Open questions

How to quantify autonomy safely: What’s the best mix of auto-fixes vs. human approvals by risk level?
Continual learning: How should Codex ingest post-incident lessons without forgetting old ones or learning the wrong lesson?
Economic impact: Beyond MTTR, how do we measure business value (revenue saved, support tickets avoided) consistently?
Cross-stack robustness: How well does this approach transfer across clouds, languages, and mobile/edge platforms?

06Conclusion & Future Work

In three sentences: Rakuten embedded Codex, an AI coding agent, into incident response, CI/CD, and development so the same helper can read logs, propose fixes, check security, and even build full features. This cut recovery times roughly in half and turned quarter-length projects into weeks, while keeping safety high by enforcing internal standards automatically. Engineers now spend more time defining what great looks like and verifying outcomes, rather than hand-crafting every step.

Main achievement: Showing that one agent, tuned to company rules and wired into tools like KQL and CI/CD, can make speed, safety, and autonomy rise together—without trading one for another.

Future directions: Broaden tool coverage (more languages/frameworks), richer risk-based approvals, deeper integration with runbooks and post-incident learning, and expanded mobile/edge support. Expect better metrics too, such as reliability per feature shipped and security issues prevented.

Why remember this: It’s a blueprint for modern engineering—give an AI teammate the context, tools, and guardrails, and it turns long-standing bottlenecks (debugging, reviews, vague specs) into quick, verifiable steps that help big teams ship fast and safely.

Practical Applications

•Integrate Codex into your CI/CD to auto-enforce your company’s coding and security standards on every pull request.
•Use Codex to draft KQL queries from alerts to speed up root-cause analysis during incidents.
•Adopt plan–execute–verify loops where Codex proposes fixes, runs tests/scans, and iterates before human review.
•Seed Codex with your internal guidelines (auth patterns, logging rules) so suggestions match your architecture.
•Let Codex scaffold new services and mobile apps from partial specs to jump-start greenfield projects.
•Automate dependency and secret scanning, and block merges when critical issues are found.
•Have Codex generate unit/integration tests for each bug fix to prevent regressions.
•Enable rollback/feature-flag playbooks that Codex can trigger or propose during outages.
•Use Codex to produce readable post-incident summaries with the exact queries and evidence used.
•Pilot risk-based approvals: allow Codex auto-fixes under a size/risk threshold while requiring human sign-off for high-risk changes.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes