The Failure Ledger — 32 Production Lessons
Credibility comes from the failures
The fastest way to spot a fake AI operating-system story is the absence of incidents.
Real systems fail. Tokens expire. APIs paginate silently. Cron runs without environment variables. Agents answer from stale memory. Dashboards show old data. A model migration changes tone. A webhook receiver misses expiring media. A cheap model works in testing and collapses under load.
The reference deployment is credible because it has a failure ledger. Chapter 11b contains the longer production lessons. This chapter is the index: what failed, what fixed it, and the one rule worth carrying forward.
Use it before you build.
The ledger
| # | What failed | Fix | One-rule lesson |
|---|---|---|---|
| 1 | Shared OAuth tokens caused cascading rate-limit failures across agents | One token per agent, capped retries, health checks | Isolate credentials by worker |
| 2 | Cached credentials overrode environment settings | Inspect true auth priority chain and clear hidden caches | Debug the credential actually used |
| 3 | macOS services launched in the wrong mode | Use the right LaunchDaemon/LaunchAgent pattern and explicit HOME |
Service managers are part of the app |
| 4 | Invalid runtime config created crash loops | Add config doctor and validation after edits | Validate before restarting the fleet |
| 5 | Unused messaging channels replayed messages and caused storms | Reduce channel surface and disable unused listeners | Fewer channels means fewer failure modes |
| 6 | API header assumptions broke Stockagile-style auth | Test auth with curl before coding wrappers | Vendor auth formats are not universal |
| 7 | Semantic memory filled with duplicates and noise | Deduplicate, curate, and prefer structured docs | Memory without hygiene degrades agents |
| 8 | Policies changed but the brain did not | Add brain updates to product and policy launch checklists | If it is not in the brain, agents do not know it |
| 9 | Autonomy was tempting before quality was proven | Shadow mode, review queues, confidence thresholds | Draft first, approve later, automate last |
| 10 | Reverse-engineered sessions depended on 2FA state | Add session health checks and relogin runbooks | Browser sessions are operational dependencies |
| 11 | A memory product created more noise than recall | Remove harmful components even if they sounded strategic | Kill features that hurt the system |
| 12 | One host could not comfortably run everything | Split primary VPS and local/secondary host roles | Architecture follows operational load |
| 13 | VPN-only MCP access slowed team rollout | Use a public HTTPS tunnel with auth | Access friction kills adoption |
| 14 | Cron jobs failed because env vars were missing | Source config explicitly inside cron scripts | Cron is not your shell |
| 15 | Skills lived inside one workspace | Expose skills through MCP/shared brain | Procedures must be shared infrastructure |
| 16 | Plugin config overwrote existing settings | Extend configs instead of replacing them | Merge, do not clobber |
| 17 | An important workflow had no official API | Build a small microservice around the actual need | Missing APIs require owned adapters |
| 18 | Interactive permissions hung unattended agents | Configure headless-safe permission modes | Background agents cannot wait for prompts |
| 19 | World-writable plugin paths looked like a quick fix | Fix ownership and permissions properly | Never use chmod 777 as operations |
| 20 | Local loopback services were not reachable safely | Use explicit private-network port forwarding | Secure reachability needs design |
| 21 | Cheap models passed light tests and failed peak load | Route by reliability and workload, not sticker price | Cheapest reliable beats cheapest token |
| 22 | Prompt injection hardening was uneven | Roll out standard anti-injection language fleet-wide | Every agent needs hostile-input assumptions |
| 23 | Humans resent silent agents and resend work | Add an ACK rule for task receipt | Acknowledge fast, execute carefully |
| 24 | Free-tier LLMs were treated like production capacity | Keep free/cheap models as fallback only | Free tiers are not infrastructure |
| 25 | Runtime process behavior did not match systemd Type=simple |
Use the correct process manager pattern | Know whether your process forks |
| 26 | Vendor APIs changed silently | Monitor successful semantic calls, not just uptime | HTTP 200 is not proof of correctness |
| 27 | Async Playwright/scraper code collided with MCP event loop | Process-isolate blocking or complex async work | Keep tool servers responsive |
| 28 | Heartbeat schema drift broke runtime validation | Add doctor/fix after runtime upgrades | Schema migration is an operations task |
| 29 | Existing subscriptions were not used in model routing | Reuse reliable paid seats where appropriate | Cost optimization can be architectural |
| 30 | Model migration changed personality and output style | Retune SOUL/prompts per model | Provider swaps are behavior changes |
| 31 | Fleet updates were done too broadly | Stage config fixes and validate agent by agent | Roll out swarm changes in waves |
| 32 | Node 25 broke streaming CLI behavior | Pin affected wrappers to Node 22 LTS | Pin runtimes that affect IO |
How to read this
Do not read the ledger as a list of embarrassing bugs. Read it as a build order.
If you are setting up your first company brain, lessons 7, 8, and 15 matter immediately. If you are deploying agents, lessons 1, 4, 9, 18, 21, 22, 23, and 30 matter. If you are exposing tools to a team, lessons 13, 16, 20, 26, and 27 matter. If you are automating scheduled workflows, lessons 14, 25, 28, 31, and 32 matter.
The ledger is a map of where naive implementations break.
Cross-vertical examples
A food brand running invoice intake will hit lessons 14 and 26 when supplier portals export late or cron lacks credentials. A beauty brand using AI for claims review will hit lesson 22 because customer messages can contain hostile instructions. A home brand automating warranty triage will hit lesson 8 if product version docs are stale. A pet brand running subscription workflows will hit lesson 9 if it auto-changes plans before review. An outdoor brand with seasonal spikes will hit lesson 21 if a cheap model collapses during peak warranty volume.
Different verticals, same failure modes.
What a good failure note contains
Each detailed incident file should include:
Title:
Date:
Severity:
Area:
What failed:
User/business impact:
Root cause:
Fix:
Verification:
Prevention rule:
Owner:
Related files/tools:
The prevention rule is the most valuable line. A postmortem that does not change a rule is just storytelling.
Limitations
The ledger is not complete forever. New failures will appear as the stack changes. It also reflects one reference deployment; your stack will have different vendors and constraints.
Still, the classes of failure travel: auth, memory, process management, stale docs, unsafe autonomy, API drift, permissions, model routing, and human adoption.
How to start this in your business
- Create
lessons/_ledger.mdbefore you have 32 lessons. Start with the first five things that already went wrong. - For every incident, write the one-rule lesson within 24 hours while the fix is fresh.
- Link each ledger row to a detailed incident file only when the detail changes future behavior.
- Review the ledger before every new automation project.
- Fork
lessons/_ledger.mdas the artifact and use Chapter 11b as the detailed reference model.
Fork the repo, read the playbook, and adapt the artifacts to your own stack. For hands-on help, email hello@usecompai.com.
Fork the repo