Reclaiming Value from Bugs and Outages: Thoughts on Post-Mortems

_{(Thanks to Chris Frank and Sam Decesare for the feedback!)}

Bugs and outages happen. If the team can learn from them though, it's possible to reclaim some of that lost effectiveness, sometimes even to the point that the learning is more valuable than the effectiveness lost.

The best mechanism I've seen to drive that learning is for the team to meet after the outage or defect is no longer an immediate problem and for the team to try to learn from what happened. This meeting is often called the Blameless Post-mortem. These meetings have been the biggest agents of improvement that I've ever seen on an engineering team, so if you think you'll save time by skipping them, you're probably make a huge mistake.

Institute Blamelessness

It's crucial to assume that everyone involved in the issue would've avoided it if possible, given the chance again. It's natural for people to want to point the finger (sometimes even at themselves) but if the team allows this, it'll quickly lead to a culture of cover-ups where nothing can be learned because real information about what happened can't be extracted.

What's worse than this is that when you decide who's responsible, you have no reason to continue to investigate what there is to learn. It seems obvious that since that person is to blame, if they're either removed or coerced to try harder next time, the issue won't reoccur.

This reminds me of reading about airplane crashes in the news when the airline concludes the cause was "human error". That was really the best that that airline could do? They decided to either ask that human to be better next time, or replaced that human with another human. There's really nothing to learn? No way to improve the aircraft? No way to improve the process? No way to improve the environment? No way to improve the human's understanding of these? This is an airline I'd be afraid to fly with.

Sidney Dekker's The Field Guide to Understanding Human Error, an absolutely genius book on the subject, sums it up nicely: "Human error is not the conclusion of an investigation. It is the starting point."

Look for all the Causes

Often these meetings are called RCAs or Root-Cause Analysis meetings, but I try not to call them that anymore, because there never really ever seems to be a single root cause. Computer systems are complex systems, and so usually multiple things have gone wrong in order for a defect to appear. John Allspaw explains it more succinctly:

Generally speaking, linear chain-of-events approaches are akin to viewing the past as a line-up of dominoes, and reality with complex systems simply don’t work like that. Looking at an accident this way ignores surrounding circumstances in favor of a cherry-picked list of events, it validates hindsight and outcome bias, and focuses too much on components and not enough on the interconnectedness of components.

Also make sure you've validated your assumptions on these causes. There's a powerful tendency to fall prey to What-you-look-for-is-what-you-find principle for the sake of expediency and simplicity.

Dig deeper

It's usually a good idea to dig deeper than you'd expect when determining how a problem occurred. One practice that forces a deeper investigation is The 5 Whys. Basically you just ask "why?" for each successive answer until you've gone at least 5 levels deep, attempting to find a precursor for any problem and its precursors. Often in the deeper parts of this conversation, you end up investigating bigger picture problems, like the company's values, external pressures, and long-standing misconceptions. These deeper problems often require internal or external social conflicts to be resolved, which often makes them tough, but also high-value.

A few caveats though:

It can quickly become "The 5 Whos". You still want to remain blameless.
It assumes that there's a single chain of cause and effect leading to the defect (and there rarely is).

This second reason is the reason I don't really care for The 5 Whys practice anymore. John Allspaw's got some great further discussion about that problem here as well.

Decide on Next Actions as a Team

One common yet major mistake I've seen is that a manager hears the description of the defect/outage and says "well we'll just do x then" without soliciting the input of the team. This is usually a huge mistake because the manager usually doesn't have the intimate knowledge of the system that the developers do, and even if he/she did, it is extremely unlikely that any single person can consistently decide a better course of action than a team. Also, the more a manager does this, the less likely the team is to feel that it's their place to try to solve these problems.

With that said, your team should still be conscious that some of the causal factors are not technical at all. Some examples that I've personally seen are:

The team is over-worked.
The team has been under too much pressure to move quickly or to meet unrealistic deadlines.
The team wants more training in certain areas.
Management is constantly changing priorities and forcing work to be left partially finished.
The team's workflow is too complicated and merges keep leading to undetected defects.
Management won't allow time to be spent on quality measures.

It's a common mistake for developers to focus only on the technical problems, probably because they're the most easily controlled by the development team, but I would say for a team to be truly effective, it must be able to address the non-technical factors as well, and often manage up. Great management will pay very close attention to the team's conclusions.

Resist "Try Harder Next Time"

Hand-in-hand with blamelessness should almost always be a rule that no improvement should involve "Trying harder next time". That would be assuming someone didn't try hard enough the last time, and it's assuming that only effort needs to change in order for the team to be more effective next time. People will either naturally want to try harder next time, or they won't. Saying "try harder next time" usually won't change a thing.

In fact you'd usually be more successful, not just by not trying solutions that don't require more human discipline, but to additionally take that one step further and reduce the level of discipline already required. There's a great blog post on this by Marco Ament here, and I can tell you, the results in real life are often amazing.

Humans are great at conserving their energy for important things, or things that are likely to cause issues, but the result of that is that unlikely events are often not given much effort at all. This is a common trade-off in all of nature that Erik Hollnagel calls the ETTO (Efficiency Thoroughness Trade-Off) principle. You don't want your solutions to be fighting an uphill battle against nature.

There's another kind of strange result of this (I think, anyway) called "the bystander effect", where often if a problem is the responsibility of multiple people, it's less likely that any single person will take responsibility for it. This is a real phenomenon and if you've worked on a team for any length of time, you've seen it happen. You'll want to try to make sure that whatever solutions you come up with, they won't fall victim to this bystander effect.

Consider Cost-to-fix vs. Cost-to-endure

It should go without saying that the cost of your solutions for issues should be less than the cost of the issue itself. Sometimes these costs are really hard to estimate given the number of variables and unknowns involved, but it's at least worth consideration. An early-stage start-up is unlikely to care about being multi-data-center for the sake of redundancy, for instance. It would be ridiculous, on the other hand, for a bank to not seek this level of redundancy.

Consider Process Costs

The second most naive reaction to a bug or outage (after "Try harder next time") is also usually to add some more human process to the team's existing process, like more checks, or more over-sight. While you may well conclude that this is the best measure for a given issue, keep in mind that it's probably also the slowest, most expensive and most error-prone. These sorts of solutions are often the ones that are the simplest to conceive, but if they're your team's only reaction to issues, they will build up more and more over time dooming the team to move really slowly while it works its way repeatedly through these processes.

Improve Defect Prevention

There are a tonne of possible ways to try to prevent bugs from reaching production that are too numerous to get into here, but there are two really important ways to evaluate them:

(1) Does the method find bugs quickly and early in development?

The shortness of your feedback loop in detecting bugs is hugely important in making sure that your prevention methods don't slow the team down so much that their cost outweighs their benefit. Manual human testing by a separate testing team is probably the slowest and most late way to find bugs, whereas syntax highlighters may be both the fastest and earliest for the class of issue that they can uncover (Of course these methods each test completely different things, but they're mentioned to give an idea of both extremes of feedback loops).

(2) Does the method give you enough information to fix the problem quickly/easily?

This criteria is important though admittedly probably less so than the previous one. You will want to judge your prevention measures on this criteria though, because it's another criteria that can cost you a lot of time/efficiency. Incidentally, manual human testing is probably the worst by this criteria as well, because testing at that level generally just lets you know that something is broken in a particular area. Unit-testing beats integration-testing in this particular area as well, because unit-testing does a better job of helping you pin-point the issue down to a particular unit (though they don't actually test for the same types of bugs at all, so it's a bit of an unfair comparison).

With these two criteria in mind, it's useful to look at a number of defect prevention measures critically : TDD, unit testing, integration testing, manual testing, beta testing, fuzz-testing, mutation testing , staging environments, dog-fooding, automated screen-shot diffing, static analysis, static typing, linting, pair programming, code reviews, formal proofs, 3rd-party auditing, checklists etc. I've tried to be a bit exhaustive in that list, and while I've added some options that have never been useful to me, I've probably also forgotten a few. There are new preventative measures popping up all the time too.

An amazing example of preventative measures is probably the extremely popular SQLite project. Their discussion of the measures that they take is fascinating.

Remove Human Processes with Automation

So I've hinted at this a few times so far, but it should be reiterated that automating otherwise manual human processes can bring a level of speed and consistency to them that humans can't compete with. Often times this is tricky and it's not worth it, but these scenarios are getting fewer as technology progresses. There are two huge risks in automation though:

(1) Automation also involves software that can fail. Now you have two systems to maintain and try to keep defect-free.

Often this second system (the automation of the primary system) doesn't get engineered with the same rigor as the primary system, so it's possible to automate in a way that is less consistent and more error-prone than a human.

(2) If you're really going to replace a manual human task, make sure the automation really does outperform the human.

I've seen many attempts at automation not actually meet the goal of doing the job as well as a human. It's not uncommon at all to see examples like teams with 90+% automated test coverage releasing drop-everything defects as awesome as "the customer can't log in" because some CSS issue makes the login form hidden. A team that sees bugs like this often is almost certainly not ready to remove humans from the testing process, regardless of how many tests they've written.

Eliminate Classes of Bugs

When you think about preventing defects without succumbing to "Try Harder Next Time" thought-patterns, one of the most powerful tools is to try to consider how you could make that defect impossible in the future. Often it's possible to avoid defects by working at levels of better abstraction. Here are a few examples:

Avoid off-by-one errors in for loops by using iterator functions, or functional paradigms that give you filter(), map() and reduce().
Avoid SQL injection by using prepared statements.
Avoid Cross-site scripting attacks by using an html template language that automatically escapes output.
Avoid bugs from unexpected changes in shared objects by using immutable data structures.

You may need some creativity here for your own particular defects, but in many cases eliminating the opportunity for a bug to arise is better than trying to catch it when it happens.

For example, I once worked on a team that would occasionally forget to remove console.log() calls from our client-side javascript, which would break the entire site in IE 8. By putting a check for console.log() calls in the build (and breaking it when they exist), we eliminated this class of defect entirely.

Go Beyond Prevention

Defect and outage prevention is only one side of the quality coin as well, though it's usually the side people naturally try to employ when trying to figure out how to handle defects better in the future. You should of course investigate better prevention measures, but you should also consider solutions that will improve your situation when defects do occur, because failures will always happen.

I personally think it's entirely unrealistic for all your measures to be preventative. A focus entirely on preventative measures has a tendency to slow down your team's ability to deliver while at the same time not delivering the level of quality that you could.

With that said, here are a few classes of mitigating measures:

Improve your "Time to Repair"

There's an interesting metric called MTTR which stands for "Mean Time to Repair/Recovery", and is basically the average time it takes you to fix a defect/outage. It's an important metric, because the cost of a defect must include how long that defect was affecting customers. The speed at which you can deliver a fix is going to be a major factor in how defect mitigation. You'll want to ask questions like:

How can we pinpoint problems faster?
How can we create fixes faster?
How can we verify our fixes faster?
How can we deliver fixes faster?

Practices like Continuous Delivery can help here greatly. If you have a 20 minute manual deployment that involves a number of coordinated activities from a number of team members, you will be leaving customers exposed for much longer than a team practicing Continuous Delivery.

Automated testing on its own can be a huge help. If the bulk of your tests is manual, then a fix will take some time to verify (including verifying that it doesn't break anything else). Teams that rely heavily on manual testing will usually test much less thoroughly on a "hot fix", which occasionally can lead to worsening of the situation.

In my experience though, nothing affects MTTR as much as the speed at which you can detect defects/outages...

Improve Detection Time

Talk about how quickly you discovered the issue compared to when it was likely to have started. If your customers are discovering your issues, try figuring out if there's a way that you can beat them to it. Instrumentation (metrics & logging) has been a huge help for me in different organizations for knowing about problems before customers can report them. Information radiators can help keep those metrics ever-present and always on the minds of the team members.

Threshold-based alerting systems (that proactively reach out to the team to tell them about issues) in particular are valuable because they don't rely on the team to check metrics themselves, and they can short circuit that "human polling loop" and alert the team much faster, or during times that they ordinarily would not be looking (at night, on weekends, etc). It's pretty easy to see that an alerting system that alerts about an outage on a Friday night can save the customer days of exposure.

Lessen the Impact of Failures

If you can figure out ways for failures to have less impact, that's a huge win as well. Here are a few examples of ideas I've seen come out of one of these meetings:

Have deployments go to only a 5% segment of users first for monitoring before going out to 100% of the users.
Speed up builds and deployments, so hot-fixes can go out faster.
Have an easy way to display a message to users during an outage
Improve metrics and logging to speed-up debugging around that particular issue.
Set-up off-hours alerting.
Have a one-click "revert deployment" mechanism that can instantly revert to a previous deployment in case something goes wrong.
Create "bulkheads/partitions " in your applications so that if one part fails, the rest can still function properly. There are many common examples of this in software, including PHP's request partitioning model, or the browser's ability to continue despite a javascript exception, even on the same page. Service-oriented architectures often have this quality as well.

You may or may not need some creativity here to come up with your own, but it's worth the effort.

Be Realistic with Plans For Improvement

Whatever you say you will do as a result of this meeting, make sure that it's actually realistic, and there's a realistic plan to get it into the teams future work (eg Who will do it? When?). The best way to have completely useless meetings is to not actually do what you plan to do.

Write Up a Report for the Rest of the Company

The report should say honestly how bad the problem was, in what way (and for how long) customers were affected, and generally what events lead up to it (blamelessly!). Additionally you'll want to declare the next steps that the team plans to take, so that the company knows you're a professional engineering team that cares about results as much as the other people in the organization. You should be ready and willing to take questions and comments on this report, and send it to as many interested parties as possible. Often other people in the company will have additional ideas or information, and the transparency makes them feel like these are welcome any time.

The real value in this is that you show that the entire team is uniformly holding itself accountable for the problem, and that any propensity that the rest of organization has for blaming a single person is not in accordance with the engineering team's views. The engineering team and management should be willing and able to defend any individuals targeted for blame.

Decide What Types of Defects/Outages Necessitate These Meetings

Some organizations are more meeting-tolerant than others, so there's no hard-fast rule here. If you had one of these meetings for every production defect, you'd probably very quickly have a bunch of solutions in place that greatly reduces the number of defects though (and therefore reduces the number of these meetings!). These meetings are all investments. The more you have, the more they start to pay off, both with quality of product and with speed of delivery (if you stay conscious of that!).

One thing I will recommend though is that you look for recurrences and patterns in these defects/outages. The team will usually benefit disproportionately from more time invested in solving repeated problems.

Tags: quality, psychological safety, blamelessness, post-mortem, practices, defects

← Back home