The "Zero Incidents" Trap: Why Chasing Perfection Backfires
I'll confess: for years, I was an adamant believer.
I was convinced that achieving "zero incidents" was just a matter of more thorough testing and more rigorous code reviews.
I chased this ghost, but I always failed to see it in action.
That all changed when I learned Erlang.
It took me a while to truly grasp the "let it crash" philosophy, but when it finally clicked, it changed my perspective forever.
I realised that instead of trying to defend against every inevitable failure, the real goal is to build systems that can detect failures early and recover from them rapidly.
It’s about building for resilience, not for an imaginary state of perfection.
![]() |
Death of Icarus by Alexandre Cabanel |
TL;DR;
The "zero incidents" goal is a well-intentioned but flawed aspiration that often backfires. It eventually leads to a culture of fear that discourages learning from small failures, which paradoxically increases risk. This makes systems more fragile and burns out your best engineers.
The better path is Site Reliability Engineering (SRE). Instead of chasing an impossible target, SRE uses data-driven tools like Service Level Objectives (SLOs) and error budgets to manage risk pragmatically. It's how we can innovate quickly while keeping our services stable and our users happy.
Aspect | "Zero Incidents" Approach | SRE Approach |
---|---|---|
Primary Goal | Chasing perfection (100% uptime) | Balancing reliability & innovation ⚖️ |
View of Risk | A threat to be eliminated at all costs | An economic trade-off to be optimised 📈 |
View of Failure | An unpleasant disruption | An opportunity to learn something 💡 |
Key Mechanism | Top-down mandates | Data-driven error budgets 📊 |
Cultural Impact | Psychological unsafety, blame, and reluctance to surfacing problems | Psychological safety & shared ownership 🤝 |
1. The "Zero Incidents" Trap
1.1. The Allure of a Flawless Record
We've all heard mandates and aspirations similar to "Our new goal is zero incidents!"
It’s a simple, powerful slogan that signals a deep commitment to quality and predictability. The philosophy is straightforward: bug fixes and stability work must take absolute priority over developing new features.
But here’s the problem: in any complex software system, a true state of zero incidents is a mathematical impossibility. It’s an attempt to apply a factory-floor mindset to the inherently messy and unpredictable world of software development. The modern focus, even in industrial safety, has shifted away from preventing every error to building systems with the resilience to "fail safely". That's a goal worth chasing.
1.2. The Hidden Damage of Chasing Perfection
When you treat "zero incidents" as a literal goal or aspiration, you don't foster excellence; you lead the way to an eventual culture of psychological unsafety. The intense pressure leads to employee burnout and "safety fatigue".
This silence is toxic. It drives the most valuable learning opportunities underground: The minor issues, the near-misses. We lose the chance to learn from small problems before they cascade into catastrophic ones, creating a "cultural debt" that makes the entire organisation more fragile.
1.3. A Smarter Alternative: The "Zero Known Bugs" Policy
So, what's the pragmatic alternative? Mature organisations reinterpret this idea not as a literal goal, but as a clear policy for handling known defects.
The principle is simple: Known bugs that cause significant harm must be fixed with urgency before new feature development proceeds. This requires a rigorous triage process to classify issues based on actual impact, not just a raw bug count:
- Critical Issue: The system is actively harmed (e.g., "the shop is on fire" 🔥). All other work stops to fix it.
- Improvement: A defect that is not critical. It's reclassified and prioritised against other feature work.
- New Feature: Missing functionality that was not in the original scope. It's added to the product roadmap.
This simple reframing forces a rational, risk-based discussion about priorities instead of an emotional reaction to an arbitrary bug count.
2. A Superior Way: Thinking in Reliability with SRE
This brings us to Site Reliability Engineering (SRE).
SRE completely flips the script. Instead of aiming for the impossible goal of preventing all failure, its core philosophy is to manage risk in a smart, data-driven way. SRE is founded on the principle of "embracing risk", acknowledging that 100% reliability is not only impossible but also an economically irrational target for most services.
The real goal is to find that sweet spot—the optimal level of reliability where users are happy, but we haven't stifled innovation or spent a fortune chasing perfection.
2.1. Finally, a Way to "Fail Fast" Safely
You’ve probably heard the mantra "fail fast, fail often, fail early". It's the foundation of modern agile development. A literal "zero incidents" policy is its mortal enemy, as it creates a fear of failure that grinds innovation to a halt.
SRE, on the other hand, provides the perfect framework to make "fail fast" a reality in a disciplined way:
- Error Budgets Give Permission to Fail: This is the game-changer. An error budget is a data-driven allowance for how much "unreliability" is acceptable for a service. As long as you’re within budget, you have a green light to experiment and release new features. It's explicit, data-backed permission to take calculated risks.
- Blameless Culture Creates Safety: SRE treats every incident as a priceless learning opportunity, not a reason to point fingers. Blameless postmortems create the psychological safety needed for teams to be brutally honest about what really happened, allowing us to fix systemic issues instead of stopping at "human error".
- Chaos Engineering is Failing by Design: This is the ultimate expression of the philosophy. We intentionally and carefully inject failures into our systems to proactively find weaknesses before they cause a real outage. It's about building resilience by training for failure.
3. The Fintech Imperative: Why SRE is Non-Negotiable
Now, let's talk about the big leagues. For those of us working in financial services, this isn't just a philosophical debate. It's a matter of survival.
Fintech operates at a brutal intersection of rapid innovation, intense regulatory scrutiny, and a non-negotiable demand for reliability. Here, failures aren't just bugs; they can cause immediate, irreversible financial loss and shatter the trust that underpins the entire system.
3.1. SRE Speaks the Language of Regulators
Here’s the REAL win: Global regulators are shifting their focus. They're moving away from asking "how do you prevent all incidents?" to "how resilient are you when an incident occurs?". This is a massive shift, and it aligns perfectly with SRE.
3.2. Defining What Matters: From Journeys to SLOs
So, where do you start? You map your most Critical User Journeys (CUJs) and define user-centric SLOs for the services that support them. It’s all about measuring what your users actually care about.
Let's see some action!
- Journey: A customer makes a payment.
- SLO: 99.95% success rate for the Payment API.
- Journey: A trader executes a stock trade.
- SLO: 99% of orders confirmed in under 100ms.
- Journey: A new user signs up.
- SLO: 95% of users complete the onboarding flow in under 10 minutes.
This isn't just about uptime anymore. It's about defining success from your customer's perspective and using that data to drive every decision.
4. Engineering for Real-World Resilience
Alright, theory is great, but how do we actually build systems that can take a punch? It comes down to a few core engineering disciplines.
4.1. Build Resilient Architecture
You can't bolt on resilience at the end; it has to be in the DNA of your system. This goes beyond just favouring modular architectures over monoliths. It also means embracing advanced deployment strategies that assume failure is not just possible, but quite likely. The goal is to mitigate impact and detect problems early.
- Progressive Rollouts: Using techniques like canary releases and blue/green deployments to expose new code to a small subset of users before a full rollout.
- Decoupling Deployment from Release: Using feature flags to turn features on or off in production without a new deployment, giving you an instant kill switch.
- Designing for Graceful Degradation: Building systems that can operate in a partially degraded state rather than failing completely. (Heads up: this is quite an advanced topic and deserves its own post. Maybe I'll get to it in the near future!)
- Smart Retries: Implementing client-side logic like retries with exponential backoff and jitter to handle transient network or service failures gracefully.
All this is powered by robust CI/CD pipelines. The goal remains the same: make change small, frequent, and low-risk.
Remember: automation beats heroics. Every. Single. Time.
4.2. Supercharge Your Testing
If you're not testing for failure, you're not testing for reality.
- Contract Testing: Verify the "contracts" between your services to prevent integration failures before they happen. It's a powerful way to eliminate entire classes of production incidents.
- Chaos Engineering: Proactively inject failures into your systems. Break things on purpose, in a controlled way, to find weaknesses before your customers do. It's the ultimate stress test.
4.3. Eliminate Toil: The Silent Killer
Let's talk about the silent killer of engineering productivity: toil. As defined in the SRE book, toil is any operational work that is:
- Manual
- Repetitive
- Automatable
- Tactical (not strategic)
- Provides no enduring value
Toil is the interest payment on your technical debt. It's the perfect example of Eisenhower's "urgent but not important" work: The tasks that scream for your attention but don't move the needle on your strategic goals. The primary goal is to relentlessly automate toil, freeing up engineers to work on what is truly important: Building better, more resilient systems.
5. Reporting Reliability the Right Way
So how do you prove any of this is working to leadership?
Stop reporting on the past. The conversation must shift from lagging indicators (like the number of incidents last quarter) to forward-looking, strategic metrics.
A truly valuable reliability report includes:
- SLO Performance Dashboards: An at-a-glance, real-time view of the health of critical business services.
- Error Budget Trends: This is your crystal ball. It's not about what broke, but what might break. It's a forward-looking indicator of accumulating risk.
- Connection to Business KPIs: Speak their language. Show the direct correlation between a healthy error budget and lower customer churn, or between SLO performance and higher feature velocity.
- A Summary of Learnings: Frame incidents not as failures, but as investments in resilience. What did we learn, and how are we stronger because of it?
This changes the conversation from "Why did things break?" to a much more powerful question: "How are we managing our risk to innovate faster and more safely?"
Final Words
Ultimately, this shift isn't just about changing metrics; it's about upgrading your entire engineering culture by choosing to trade fear for data, and blame for learning. The real question isn't whether you'll have incidents, but what you'll do with them.
So, I'll leave you with this:
Are you building systems to avoid failure, or are you building them to learn from it?
Comments
Post a Comment