I always try to think about what exactly it is about experience that makes a software developer better. Are there aspects of experience that are teachable, but we don't understand them yet well enough to teach them? Can we do better to pass on "experience" rather than have every developer suffer through the same mistakes that we suffered through?

I know I've had a couple of hard-won lessons over the years that really helped me be more successful in software engineering. I've been able to see the warning signs of mistakes to avoid for a long time, but I think only recently I figured out the reasons behind those warning signs. And I think some of the reasons can be explained mostly in terms of probability math, which can then be taught right?

Before I go into this, I'd like to preface this by saying I've failed many many math classes. I hardly ever use any advanced math in my work, and doubt many other programmers do. I wish I had more patience and learned more math (especially how to apply it) too. So with that said, here goes...

The first piece of experience I'd like to pass on is that your "flawless" solution to a problem, at scale, over time, will fail.

I don't know how many times I've heard an engineer say something like "This solution is basically bulletproof. There's only a 1 in a million chance that that corner case you just mentioned will occur", and then promptly put the solution into an environment that does 1 billion transactions a month.

Here's how the math looks:

```
1B transactions/month * 1/1M probability of failure per transaction
= 1B / 1M failures per month
= 1000 failures per month.
```

Yes, the mathematics of probability are telling us that that particular solution will fail roughly 1000 times a month. This is probably a "duh" conclusion for anyone with a computer science degree, but I see developers (yes even ones with computer science degrees) failing to apply this to their work all the time.

At scale and over time, pretty much everything converges on failure. My favourite thing to tell people is that "At this scale, our users could find themselves in an `if (false) {`

statement.".

So what does this mean? Well it doesn't mean everything you do has to be perfect. The downsides of failure could be extremely small or completely acceptable (In fact actually pursuing perfection can get really expensive really quickly). What this tells you though is how often to expect failure. For any successful piece of software, you should expect it often. Failure is not an exceptional case. You have to build observable systems to make sure you know about the failures. You have to build systems that are resilient to failure.

Often I hear people talking about solution designs that have high downsides in the case of failure with no back-up plan and defending them with "But what could go wrong?". This is a tough one for an experienced developer to answer, because having experience doesn't mean that you can see the future. In this case all the experienced developer knows is that something will go wrong. Indeed when I'm in these conversations, I can sometimes even find one or two things that can go wrong that the other developer hadn't considered and their reaction is usually something like "Yeah, that's true, but now that you mentioned those cases I can solve for them. I guess we're bulletproof now, right?".

The tough thing about unknown unknowns is that they're unknown. You're never bulletproof. Design better solutions by expecting failure.

Understanding this lesson is where ideas like chaos monkey, crash-only architecture, and blameless post mortems come from. You can learn it from the math above, or you can learn it the hard way like I did.

Here's the second piece of mathematically-based wisdom that I also learned the hard way instead: If you have a system with multiple parts relying on one another (basically the definition of a system, and any computer program written ever), then the failure rate of the system is a *multiple* of the failure rates of the individual components.

Here's an almost believable example: Let's pretend that you've just released a mobile app and you're seeing a 0.5% crash rate (let's be wildly unrealistic for simplicitly and pretend that all the bugs manifest as crashes). That means you're 99.5% problem-free right? Well what if I told you the backend has an error rate of 99.5% too and it's either not reporting them properly or your mobile app is not checking for those error scenarios properly?

Probability says that you compute the total probability of error by multiplying the two probabilities, ie:

```
99.5% X 99.5% = 99%
```

What if you're running that backend on an isp and that's got a 99.5% uptime? And your load balancer has a 99.5% success rate? And the user's wifi has 99.5% uptime? And your database has a 99.5% success rate?

```
99.5% X 99.5% X 99.5% X 99.5% X 99.5% X 99.5% = 97%
```

Now you've got a 3% error rate! 3 out of every 100 requests fails now. If one of your users makes 50 requests in their session, there's a good chance that at least one of them will fail. You had all these 99's and still an unhappy user because you've got multiple components and their combined rate of error is the product of each component's error rate.

This is why it's so important when ISPs and Platforms-as-a-service talk about uptime with "5 nines" or 99.999%. They know they're just one part of your entire system, and every component you have in addition has a failure rate that compounds with their baseline failure rate. Your user doesn't care about the success rate of the components of the system -- the user cares about the success rate of the system as a whole.

If there's anything at all that you should take from this, it's that the more parts your system has, the harder it is to keep a high rate of success. My experience bears this out too; I don't know how many times I've simplified a system (by removing superfluous components) only to see the error rate reduce for free with no specific additional effort.

We saw in the last lesson one example of how surprising complexity can be. The inexperienced developer is generally unaware of that and will naively try to predict the future based on their best understanding of a system, but the stuff we work on is often complex and so frequently defies prediction. Often it's just better to use past data to predict the future instead of trying to reason about it.

Let's say you're working on a really tough bug that only reproduces in production. You've made 3 attempts (and deployments) to fix it, all of which you thought would work, but either didn't or just resulted in a new issue. You could think on your next attempt that you've absolutely got it this time and give yourself a 100% chance of success like you did the other 3 times. Or you could call into question your command of the problem entirely and assume you've got a 50/50 chance, because there are two possible outcomes; success or heartache. I personally treat high complexity situations like low data situations though. If we treat the situation like we don't have enough information to answer the problem (and we've got 3 failed attempts proving that that's the case), we can use Bayesian inference to get a more realistic probability. Bayesian inference, and more specifically, Laplace's Law tells us that we should consider the past. Laplace's Law would say that since we've had 0 successes so far in 3 attempts, the probability for the next deployment to be a success is:

```
= (successes + 1) / (attempts + 2)
= (0 + 1) / (3 + 2)
= 1 / 5
= 20 %
```

This is the statistical approach I use to predict too. With no additional information, if I've failed at something a number of times, the chances of me succeeding in the future are reduced. I don't use this data in a depressingly fatalistic way though -- I use it to tell myself when it's time to make more drastic changes in my approach. Something with a 20% chance of success probably needs a much different, more powerful approach. I also stop telling people "I'm sure I've definitely got it this time".

Similarly, if there's an area of the codebase that has received a comparatively large number of bug-fixes, I'm going to lean towards it being more likely to have bugs than the areas having less bugfixes. This may now seem obvious, but if you're the person that did all those bugfixes, you may be biased towards believing that after all those fixes, it must be less-likely to still be buggy. I'd happily bet even money against that, and I've actually won a lot of money that way.

I think it's fair to say that this is extremely counter-intuitive and may on its face look like a form of the Gambler's fallacy, but remember it's for use in low-information scenarios, and I would say that numerous defects are clear evidence you didn't have enough information about the system to predict this bug in the first place.

Relatedly, Extreme Programming has a simplified statistical method of estimating how much work a team can handle in a sprint called "Yesterday's Weather". Instead of having a huge complicated formula for how much work a team can handle in a given timeframe, they simply look at what the team was able to accomplish last sprint. It's going to be wrong a lot of course, but so is whatever huge complicated formula you devise.

If there's a more generalized lesson that you should take from this, it's that we work with complex systems, they're notoriously difficult for humans to predict, and predicting statistically can often get you a clearer picture of the reality of a situation. Resist prediction from first principles or through reasoning. You don't have enough data, and software systems are normally elusively complex.

With all this said, I know it's pretty hard to really get a feel for the impact of probabilities until you've had the prerequisite defects, failed deployments, and product outages. I sure wish someone had at least taken a shot on telling me in advance though. It also makes me wonder what I've still got left to learn.

← Back home