The Quality Feedback Loop

A lot of engineering management and product development process conversation tends to be based on a fundamental assumption that product quality and development speed are always opposing forces. Time and again however I find myself learning and relearning that speed and quality can instead be symbiotic and improving one can also improve the other. These win-win scenarios are actually everywhere as long as you’ve got your mind open to the fact that they’re possible.

One place where I think that this is obvious is the multiple feedback loops around quality. New features go through these loops over and over and so it’s hugely important to optimize them if you want to ship software quickly.

There is also a sometimes argued axiom in software development that bugs found sooner are often the cheapest / fastest to fix. I do agree that the empirical data on the subject is a little light, but I think it follows naturally from the fact that longer feedback loops make systems harder to manage and reason about. For example, if your only quality measure for anything you do is customer reports and you do nothing else for quality whatsoever, solving issues is extremely time-consuming and error-prone.

So lets start with that outer loop and list a bunch of other common quality feedback loops:

  • Customer reported bug
  • Logged error seen during regular inspection
  • Error alert happening when a defect occurs
  • Defect Found by testers before launch
  • Broken CI build
  • Broken test suite
  • In-Editor error

Customer reported bugs

These are of course the most expensive; The feedback loop is largest here. If you’re maintaining a large feature set, it’s possible that you don’t remember the details of how a feature works, or even that the software has that feature at all. These defects also come back at inopportune times and interrupt you while you’re working on other tasks. You’re likely also on a team so the defect probably isn’t even one that you had any hand in. Lacking all of this context makes solving the defect all the more difficult and all the more time-consuming.

There’s the obvious cost to the user as well. Production defects can cost you customers, or in extreme cases can even lead to lawsuits.

Ultimately you want to find a way to shorten the feedback loop, which usually means trying to move this to an early feedback loop.

As long as the measure that tightens the feedback loop is cheaper effort-wise than the defect, you’ve got improved speed and quality. It almost always is cheaper though because these loops are run many times for each feature (though there are some common types of loops that are especially expensive, like manual human regression testing).

Logged errors and metrics seen by regular log inspection

Production logs are a critical part of making an application self-reporting. If you’re regularly checking your logs (and keeping the signal to noise ratio in them high) there’s a good chance of finding production defects before users do.

That’s great because it can catch things weeks or months before customer reports do sometimes, and that faster feedback loop means you’re more likely to remember the affected area of code.

Usually for these types of issues though, we can go one level deeper…

Error alert happening when a defect occurs

If you have a system set up where logged errors increment a metric, you can find a way to put an alert on a threshold for that metric. There are a bunch of services you can integrate for this functionality, or you can run your own kibana service. The point is that your production systems can be self-reporting: they can tell you when there are problems, thus tightening the feedback loop further.

This is also super useful, because it really helps your mean-time-to-repair. Minimizing the amount of time it takes to find a defect in production also helps minimize the amount of time a user is affected by that defect.

Defect Found by testers before launch

Unfortunately I think the most common way of finding defects is manual human inspection. It’s a natural choice of course, but it’s by far the slowest and most error-prone. It’s a valid method if you can’t solve your issues otherwise, but the repeated compounded cost including the time to test, and the way it affects your ability to quickly deliver software shouldn’t be ignored. When a good automated test is possible, it’ll be both faster and less error-prone. I work on a production system that has ~4500 automated tests that run dozens of times per day. Having humans do that is impossible.

With all that said, these are still defects that are found earlier than in production and so they save your customers from the defect, and they lead to a tighter feedback loop. It’s just that this feedback loop is so expensive that as a developer you really shouldn’t be relying on it the way you can rely on even tighter feedback loops.

Broken CI build

The first line of defence after your work leaves your machine is the CI build. Any quality measures you have in your build process (which I’ll get into shortly) should be part of this build and they’ll verify that what you’ve got in the main branch is ready to move on to. If the main branch doesn’t pass the same barrage of quality measures as the local machine build is supposed to, it certainly shouldn’t move past this step on its way to production. It could be that this is your last line of defence before affecting customers, or it could be that you have a human tester that can at least know not to bother testing a broken build (Ideally, passing your quality measures is necessary for any build-artifact to exist at all so that testing a broken a build isn’t even an option).

Of course this is an easy savings to your users and any human testers you might have, but it’s still not the tightest feedback loop you can have. It’s also an expensive measure to your team-mates; it means there’s a period of time where the main branch is unusable, blocking their work.

CI builds really should be doing the same thing as a local developer machine’s build so that developers have a reasonable assurance that if they run the build locally and it works, it should pass in CI as well. Let’s talk about some of the quality measures that should go into a build.

Broken test suite

Automated tests can make up a feedback loop that is almost instant. On most platforms that I’ve worked, testing is fast enough that a single test almost always takes less than a second. I work on a codebase with around ~4500 automated tests that runs in about 2 minutes (albeit due to herculean efforts at parallelization). The speed of these is super important because it makes the feedback loop short and helps prevent developers from relying on CI as a personal build machine.

Comprehensive test suites are expensive! We spent a lot of time maintaining ours, adding to it, and ensuring it stays fast. It’s almost certainly our most effective quality measure too though.

Integration tests tend to be faster to write, because they test more things at once, but when they do fail, you’ve usually got some extensive debugging to do. Unit tests tend to take more time to write if you want the same level of coverage, but when they fail, you usually know exactly where the issue is. These are things to factor into your feedback loop considerations.

There are still tighter feedback loops that are cheaper to maintain though and those should be relied on where possible.

In-editor error

Any type of static analysis that can be performed in your editor/ide, like linting or static type-checking is an even tighter feedback loop still. Any problems are evident instantly while you’re still in the code and it indicates to you exactly where. This is an extremely fast feedback loop that you’ll probably want to employ where it’s possible and it makes sense.

I don’t have a tighter feedback loop than this, but in some cases you can still do better…

Abstraction that simply makes the error impossible

If it’s possible to use tools/abstractions that make the defect impossible, that beats all the feedback loops.

Some examples:

  • Avoid off-by-one errors in for loops by using iterator functions, or functional paradigms that give you filter(), map() and reduce().
  • Avoid SQL injection by using prepared statements.
  • Avoid Cross-site scripting attacks by using an html template language that automatically escapes output.
  • Avoid bugs from unexpected changes in shared objects by using immutable data structures.

Working through the levels

These levels all form a sort of “onion” of quality feedback loops where the closer you get to the middle, the cheaper the defect is.

Thinking this way, you can easily see how if your users are reporting an issue caused by a sql injection attack, you would ideally work to push that problem to tighter and tighter feedback loops where possble. If you can make it show up in logs or alerts, you can fix it before users report it. If you can have testers test for it, you can fix it before users are subjected to it. If you can write some unit tests for it, you can save your testers from having to bother. If you can use the right level of abstraction (prepared statements / parameterized queries in this case), you can eliminate the class of error entirely.

Delivering high-quality software quickly means looking at the most expensive, time-consuming or frequent classes of errors and systematically pushing them to a lower-rung in this onion of quality feedback loops. With a little situational awareness and a little creativity it’s almost always possible and leads to huge cost and time-savings over the long haul.

This is just one of the many ways that I think the speed vs quality dichotomy in software engineering is a false one.

Comments