Chapter 22 Results 5 min read

The Failure Ledger — 32 Production Lessons

Credibility comes from the failures

The fastest way to spot a fake AI operating-system story is the absence of incidents.

Real systems fail. Tokens expire. APIs paginate silently. Cron runs without environment variables. Agents answer from stale memory. Dashboards show old data. A model migration changes tone. A webhook receiver misses expiring media. A cheap model works in testing and collapses under load.

The reference deployment is credible because it has a failure ledger. Chapter 11b contains the longer production lessons. This chapter is the index: what failed, what fixed it, and the one rule worth carrying forward.

Use it before you build.

The ledger

#	What failed	Fix	One-rule lesson
1	Shared OAuth tokens caused cascading rate-limit failures across agents	One token per agent, capped retries, health checks	Isolate credentials by worker
2	Cached credentials overrode environment settings	Inspect true auth priority chain and clear hidden caches	Debug the credential actually used
3	macOS services launched in the wrong mode	Use the right LaunchDaemon/LaunchAgent pattern and explicit `HOME`	Service managers are part of the app
4	Invalid runtime config created crash loops	Add config doctor and validation after edits	Validate before restarting the fleet
5	Unused messaging channels replayed messages and caused storms	Reduce channel surface and disable unused listeners	Fewer channels means fewer failure modes
6	API header assumptions broke Stockagile-style auth	Test auth with curl before coding wrappers	Vendor auth formats are not universal
7	Semantic memory filled with duplicates and noise	Deduplicate, curate, and prefer structured docs	Memory without hygiene degrades agents
8	Policies changed but the brain did not	Add brain updates to product and policy launch checklists	If it is not in the brain, agents do not know it
9	Autonomy was tempting before quality was proven	Shadow mode, review queues, confidence thresholds	Draft first, approve later, automate last
10	Reverse-engineered sessions depended on 2FA state	Add session health checks and relogin runbooks	Browser sessions are operational dependencies
11	A memory product created more noise than recall	Remove harmful components even if they sounded strategic	Kill features that hurt the system
12	One host could not comfortably run everything	Split primary VPS and local/secondary host roles	Architecture follows operational load
13	VPN-only MCP access slowed team rollout	Use a public HTTPS tunnel with auth	Access friction kills adoption
14	Cron jobs failed because env vars were missing	Source config explicitly inside cron scripts	Cron is not your shell
15	Skills lived inside one workspace	Expose skills through MCP/shared brain	Procedures must be shared infrastructure
16	Plugin config overwrote existing settings	Extend configs instead of replacing them	Merge, do not clobber
17	An important workflow had no official API	Build a small microservice around the actual need	Missing APIs require owned adapters
18	Interactive permissions hung unattended agents	Configure headless-safe permission modes	Background agents cannot wait for prompts
19	World-writable plugin paths looked like a quick fix	Fix ownership and permissions properly	Never use `chmod 777` as operations
20	Local loopback services were not reachable safely	Use explicit private-network port forwarding	Secure reachability needs design
21	Cheap models passed light tests and failed peak load	Route by reliability and workload, not sticker price	Cheapest reliable beats cheapest token
22	Prompt injection hardening was uneven	Roll out standard anti-injection language fleet-wide	Every agent needs hostile-input assumptions
23	Humans resent silent agents and resend work	Add an ACK rule for task receipt	Acknowledge fast, execute carefully
24	Free-tier LLMs were treated like production capacity	Keep free/cheap models as fallback only	Free tiers are not infrastructure
25	Runtime process behavior did not match `systemd Type=simple`	Use the correct process manager pattern	Know whether your process forks
26	Vendor APIs changed silently	Monitor successful semantic calls, not just uptime	HTTP 200 is not proof of correctness
27	Async Playwright/scraper code collided with MCP event loop	Process-isolate blocking or complex async work	Keep tool servers responsive
28	Heartbeat schema drift broke runtime validation	Add doctor/fix after runtime upgrades	Schema migration is an operations task
29	Existing subscriptions were not used in model routing	Reuse reliable paid seats where appropriate	Cost optimization can be architectural
30	Model migration changed personality and output style	Retune SOUL/prompts per model	Provider swaps are behavior changes
31	Fleet updates were done too broadly	Stage config fixes and validate agent by agent	Roll out swarm changes in waves
32	Node 25 broke streaming CLI behavior	Pin affected wrappers to Node 22 LTS	Pin runtimes that affect IO

How to read this

Do not read the ledger as a list of embarrassing bugs. Read it as a build order.

If you are setting up your first company brain, lessons 7, 8, and 15 matter immediately. If you are deploying agents, lessons 1, 4, 9, 18, 21, 22, 23, and 30 matter. If you are exposing tools to a team, lessons 13, 16, 20, 26, and 27 matter. If you are automating scheduled workflows, lessons 14, 25, 28, 31, and 32 matter.

The ledger is a map of where naive implementations break.

Cross-vertical examples

A food brand running invoice intake will hit lessons 14 and 26 when supplier portals export late or cron lacks credentials. A beauty brand using AI for claims review will hit lesson 22 because customer messages can contain hostile instructions. A home brand automating warranty triage will hit lesson 8 if product version docs are stale. A pet brand running subscription workflows will hit lesson 9 if it auto-changes plans before review. An outdoor brand with seasonal spikes will hit lesson 21 if a cheap model collapses during peak warranty volume.

Different verticals, same failure modes.

What a good failure note contains

Each detailed incident file should include:

Title:
Date:
Severity:
Area:
What failed:
User/business impact:
Root cause:
Fix:
Verification:
Prevention rule:
Owner:
Related files/tools:

The prevention rule is the most valuable line. A postmortem that does not change a rule is just storytelling.

Limitations

The ledger is not complete forever. New failures will appear as the stack changes. It also reflects one reference deployment; your stack will have different vendors and constraints.

Still, the classes of failure travel: auth, memory, process management, stale docs, unsafe autonomy, API drift, permissions, model routing, and human adoption.

How to start this in your business

Create lessons/_ledger.md before you have 32 lessons. Start with the first five things that already went wrong.
For every incident, write the one-rule lesson within 24 hours while the fix is fresh.
Link each ledger row to a detailed incident file only when the detail changes future behavior.
Review the ledger before every new automation project.
Fork lessons/_ledger.md as the artifact and use Chapter 11b as the detailed reference model.

Ready to adapt this yourself?

Fork the repo, read the playbook, and adapt the artifacts to your own stack. For hands-on help, email hello@usecompai.com.

Fork the repo