Special thanks to my friends Deepa Joshi and Sam DeCesare for input/feedback!

The 80/20 Rule, otherwise known as the Pareto Principle has wide-reaching implications for productivity. If you can get 80% of the results you want for 20% of the effort in a given endeavour it really lets you consider an efficiency/thoroughness trade-off that’s right for your desired results. There’s a real opportunity for saving a whole lot of effort and still raking in big results for that effort when you find a scenario that conforms to the 80/20 rule. Unfortunately there are a bunch of scenarios in life that don’t conform to this nice 80/20 distribution and so I’d like to propose another one that I see pretty often.

The All-or-Nothing Principle: For many scenarios in life, there are 0% of the effects until there are 100% of the causes.

So while a Pareto distribution looks something like this:

…an All-or-Nothing distribution is much simpler; you’ve either got the effect or you don’t.

It applies only to systems where results are binary which can also be found all over the place: everywhere from electronic examples like light switches or mouse-clicks, to the more biological — like being pregnant or being dead.

Like the Pareto Principle, the All-or-nothing Principle can have a huge effect on productivity if you recognize it and handle it appropriately.

By Flickr user Paul Mannix - https://www.flickr.com/photos/paulmannix/552103944, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=46305687

Here’s a simple example: Bridges are expensive to build. Building a bridge even 99% of the way across a river gets you 0% of the results. This isn’t at all the type of distribution that the Pareto Principle describes and so you can’t make the same efficiency/thoroughness trade-off with it. You simply can’t get 80% of the cars across with 20% of a bridge, so in order to extract maximum productivity from these scenarios, you have to approach the situation fundamentally differently. The important thing to remember in these scenarios (assuming you’ve identified one correctly) is that there is no difference at all in effect between 0% effort and 99% effort. If all you can give is 99% effort, you should instead give no effort at all.

Let’s talk through the bridge example a little more: If you were a bridge-building company with 99% of the resources and time to build a bridge, you would be in a much better spot in general BEFORE you started at all than you’d be at the 99% mark, even though the 99% mark is so much closer to being complete. In fact, the almost-complete state is a sort of worst-case scenario where you’ve spent the most money/time possible and reaped the least amount of benefit. Where 80/20 distributions reap quite a lot of rewards for very little effort upfront, all-or-nothing scenarios reap no reward whatsoever until effort meets a specific criteria.

There are lots of examples of this in the real world too. Medical triage sometimes considers this trade-off especially when resources are constrained like in wars or other scenarios with massive casualties. Medical professionals will prioritize spending their time not necessarily on the people closest to death, but the people that are close to death that have a reasonable chance of being saved. Any particular person’s survival is an all-or-nothing proposition. It’s a brutal trade-off, but of course the medical professionals know that if they don’t think they can save a person, they should instead spend their efforts on the other urgent cases that have a better chance; the goal afterall is to maximize lives saved. In the real world, probabilities about outcomes are never certain either so I’m sure this makes these trade-offs even more harrowing. (Interestingly, if a triage system like this allows 80% of the affected population to survive by prioritizing 20% of them, the effort on the population as a whole conforms to the Pareto Principle)

The All-or-Nothing Principle and Software Development

Fortunately for me, I work in software development where being inefficient (at least in the product areas I work in) doesn’t cost anyone their life. All-or-nothing scenarios actually exist in dozens of important software development scenarios though.

Of all the scenarios, I don’t think any has shaped the practices and processes of software development in the last 25 years as much as the all-or-nothing scenario of “shipping” or “delivering” product to actual users. The reason for that is that software only gets results when it’s actually put in front of users. It really doesn’t matter how much documenting, planning, development, or testing occurs if the software isn’t put in front of users.

This specific all-or-nothing scenario in software is particularly vicious because of gnarly problems on both sides of the “distribution”:

  1. Expectations of the effort required to get to a shippable state can be way off. You never actually know until you actually ship.
  2. Expectations of the effectiveness of the software to meet our goals (customer value, learning, etc) can be way off. Will users like it? Will it work?

Where more linear effort-result distributions (including even 80/20 distributions) get you results and learning along the way, and a smoother feedback cycle on your tactics, all-or-nothing scenarios offer no such comforts. It’s not until the software crosses the threshold into the production environment that you truly know your efforts were not in vain. With cost and expected value being so very difficult to determine in software development the prospects become so much more difficult/risky. These are the reasons that a core principle of the agile manifesto is that “Working software is the primary measure of progress.”.

It should be no surprise then that one of the best solutions that we’ve come up with is to try to get clarity on both the cost and expected value as early as possible by shipping as early as possible. Ultimately shipping as early as possible requires shipping as little as possible, and when done repeatedly this becomes Continuous Delivery.

Pushing to cross the release threshold often leads us to massive waste reduction:

Users get value delivered to them earlier, and the product development team gets learnings earlier that in turn lead to better more valuable solutions.

All of this of course depends on how inexpensive releasing is. In the above scenario, we’re assuming it’s free, and it can be virtually free in SaaS environments with comprehensive automated test suites. In other scenarios, the cost of release weighs more heavily. For example, mobile app users probably don’t want to download updates more often than every 2-4 weeks. Even in those cases, you probably want a proxy for production release, like beta users that are comfortable with a faster release schedule.

We often find ways to release software to smaller segments of users (beta groups, feature toggles, etc) so we can both reduce some of the upfront effort required (maybe the software can’t handle all the users under full production load, isn’t internationalized for all locales or doesn’t have full brand-compliant polish) but also get us seeing some value immediately so we can extrapolate as to what kind of value to ultimately expect. The criteria of even releasing to one user has a surprising ability to force the unveiling of most of the unknown unknowns that must be confronted for releasing to all users, and so it’s a super-valuable way of converging on discovering the actual cost for delivering to all users. I’ve personally found these practices to be absolutely critical to setting the proper expectations about cost/value of a software project and delivering reliably.

Furthermore, there is a lot more to learning early than just the knowledge gained. Learning early means you have the opportunity to actually change what you expected to deliver based on things you learned. It means that the final result of all this effort is actually closer to the ideal solution than the one that loaded more features into the release without learning along the way.

Navigating All-or-Nothing Scenarios

Again, the benefit of identifying All-or-Nothing scenarios is that there are particular ways of dealing with them that are often more effective than solutions that would be common with other types of scenarios (like 80/20 scenarios). I’ll try to dig into a few in more detail now.

Try to make them less all-or-nothing

We’ve seen from above how continuous delivery breaks up one all-or-nothing scenario into many, thereby reducing the risk and improving the flow of value.

More theoretically, it’s best to try to change unforgiving all-or-nothing scenarios into more linear scenarios where possible. That completely theoretical bridge builder could consider building single lane bridges, or cheaper, less heavy-duty bridges that restrict against heavier traffic, or even entirely different river crossing mechanisms. These are just examples for the sake of example though… it’s probably pretty obvious by now that I know nothing about building bridges!

Software is much more malleable and forgiving than steel and concrete, so us software developers get a much better opportunity to start with much lighter (but shippable) initial passes and to iterate into the better solution later. Continuous delivery, the avoidance of Big-Bang releases, is an example of this.

Minimize everything except quality

In all-or-nothing scenarios, reducing the scope of the effort is the easiest way to minimize the risk and the ever-increasing waste. It can be tempting sometimes to consider also reducing quality efforts as part of reducing the scope, but I’ve personally found that reducing the quality unilaterally often results in not actually releasing the software as intended. The result is that you leave a lot of easy value on the table, don’t capture the proper learnings early, and set yourself up to be interrupted by your own defects later on as they ultimately need to be addressed. If you don’t really care about the quality of some detail of a feature, it’s almost always better to just cut that detail out.

Realize that time magnifies waste

For the types of work that need to be complete before the core value of an endeavour can be realized, there’s not only a real cost to how long that endeavour remains incomplete, but the more it becomes complete without actually being completed, the more it costs. In classical Lean parlance, this is called “inventory”. Like a retail shoe store holding a large unsold shoe inventory in a back storeroom (it turns out that selling a shoe is generally an all-or-nothing scenario too!), effort in an all-or-nothing scenario has costs-of-delay. A shoe store’s costs-of-delay include paying rent for a back store-room and risking buying shoes it can’t sell. It’s for this reason that Toyota employs a method called “Just-in-time” production. It’s the cornerstone of Lean manufacturing (that has coincidentally also had a huge influence on software development!).

In the case of software, larger releases mean more effort is unrealized for longer. This results in less value for the user over time and less learning for the company. If you can imagine a company that only releases once a week, high-value features (or bug fixes) that are completed on Monday systematically wait 4 days before users can get value from them. Any feature worth building has a real cost of delay.

Similarly, more parallel work in progress at once means there’s multiple building inventories of unfinished work. If a team can instead collaborate to reduce the calendar time of a single most-valuable piece of work, this waste is greatly reduced.

Stay on the critical path

In all-or-nothing scenarios, it’s crucial to be well aware of the critical path, the sequence of things necessary to get you to the “all” side of an all-or-nothing scenario. These are the things that will affect calendar time and affect how quickly you learn from users and deliver value. In software engineering, this is often called the steel thread solution. If you find yourself prioritizing non-critical path work over critical path work (which is common if you’re not ever-vigilant), you’d do well to stop and get yourself back on track.

Pick your battles!

If you’re faced with an all-or-nothing scenario, and you know or come to realize that you can’t make the 100% effort required, the most productive thing you can do is to not even try. Like the 99% complete bridge, your efforts will just be pure waste.

Sometimes this can be extremely hard. It’s why the term death march) exists for software projects and project management in general. People just have a really hard time admitting when a project cannot succeed, even when they know so intellectually. It can be particularly difficult when the project has already had effort spent on it, due to sunk cost fallacy, but that doesn’t change the proper course of action.

If you find yourself in a truly unwinnable all-or-nothing scenario, it’s best to forgive yourself for not pursuing that project anymore and to put your efforts/resources somewhere with better prospects. It’s not quitting — it’s focusing.

Special thanks to my colleagues Sam Chaudhary, Lexi Ross and Stuart Reavley for feedback/criticism!

I’m not a management guru, and I’ve never taken a management class. I have done some management, and a hell of a lot more coaching (which is often more powerful than traditional management) though, and I’ve worked with and for managers of varying degrees of effectiveness and I’ve picked up a few things that I’ve seen work really well.

While I’ve got a lot of experience with management of technical systems, I’m really just going to talk about the management of people here as if it’s the only type of management. It certainly isn’t, but it’s important enough to attempt to discuss separately.

Management is sometimes a title, but it’s more important that you think of it as a responsibility. It’s the responsibility to be effective in the larger organization.

Let’s break that down further: being effective means achieving the desired results for a given endeavour. When you’re not a manager, you can often get away with just trying hard and waiting for someone else to correct your course when the results aren’t sufficient (though I hope it’s obvious later that I don’t suggest this). Conversely, managing requires watching results tirelessly and changing tactics that aren’t working until they work.

Congratulations, You’re Hired!

You’re already in a management position: No matter who you are, you’re in charge of managing yourself, and the world is constantly deciding how it values you based on your ability to be an effective person. Even if you were to become a hermit living on a remote island somewhere, you have to constantly be considering your effectiveness at how you spend your day in order to survive on a daily basis. When you see your shoelace untied and you tie it back up, that’s self-management. Before you can manage endeavours that involve other people, you’ll want to make sure you’re reasonably good at managing yourself.

Managing Without Being a Manager

A team can successfully be self-managing in varying degrees, or a team can be managed by a single individual. My preference has always been to lean as hard on the team as possible to be self-managing but regardless, “management” as a responsibility (ensuring effectiveness) must happen.

It is also possible to manage sideways and to manage up. This means that you are ensuring that management is happening regardless of your title. At any company I’ve seen, people that do this successfully (they’re effective and not an asshole) are the natural choices for management roles, and they get them much more often. You have to constantly pick high-value problems, and work within your organization to solve them.

The easiest way to start trying to manage up is to be part of your own solutions. When you find a problem that you can’t solve yourself and want to bring it to your manager, bring some potential solutions too.

Part of managing is finding your own high-value problems to solve. More commonly this is called “taking initiative”. If you’re waiting for someone else to give you the opportunity to “take initiative”, you’ve got a fundamental misunderstanding of what “taking initiative” means.

People that “want to be a manager” without having previously managed sideways or up are continuously overlooked for that role. If you can’t be successful at this without a title, you’re probably not going to be successful at it with a title either. Management isn’t for everyone; it often means a bunch of administration, bureaucracy, and conflict.

7 Core Principles

There are 7 principles that I think are important to pursue constantly and relentlessly to achieve effective management. Theoretically none of these are actually necessary for delivering results, but I’ve never seen a team consistently deliver results without them.

Integrity

Help the team identify and adhere to team values and goals.

  • Do what you say and say what you do. Relentlessly change what you’re doing to match what you’re saying you’re doing, or change what you’re saying you’re doing to match what you’re doing.

  • Be trustworthy and internally consistent.

  • Ensure that the entire team is acting as a cohesive unit.

Reliability

Reliability means rejecting responsibilities that you can’t be successful at.

  • When you find you’ve got a responsibility that you can’t fulfill, raise the issue externally immediately.
  • If you find this is happening repeatedly, figure out why and stop it.
  • When you find that the organization is exerting pressures on the team that are counter-productive, defend the team by fighting to create space for it to be productive and happy.

Alignment

Ensure the company and product vision, mission, and the strategy to achieve them are unambiguous, well known and widespread.

Work tirelessly to ensure buy-in from team-members, whether that involves changing minds or changing goals.

Two important tips for easier buy-in:

  1. Make sure people know WHY the goals are what they are. Even if they’re happy to work on any goals, the reasoning behind them can make them much more effective.
  2. Getting people’s input on goals is the easiest way to get buy-in, and a great way to get better goals.

Transparency

Ensure progress and status on the goals are well known and widespread.

  • ALWAYS keep interested parties up-to-date. Saying “I have no update” is almost always a crucial update.
  • Don’t be surprising. ALWAYS ask yourself “Would the current state of our progress be a surprise to interested parties?”.
  • Favour processes, practices, systems where transparency is built in, so it is less tedious (Nobody wants to do manual status reports, and that dislike is often an impediment to achieving transparency).
  • Make failures or difficulties transparent to the larger organization, including your efforts to solve them.
  • Effective communication is the key. Ensuring that the audience has actually understood it as easily as possible is critical. Do not waste your time writing documents/memos that people don’t read or don’t understand.

Continuous Improvement

Support continuous improvement in processes and practices.

  • Regularly revisit the current practices and processes and look for ways to improve.
  • Throw out practices and processes that cost more than you get from them.
  • Revisit practices and processes after large projects or new problems.
  • Consider ways to double down on successes and reduce failures.
  • Consider not only prevention of failures, but also mitigation of failures when they occur anyway.
  • Find ways to measure success and failure and make it more constantly obvious to everyone so they can adjust faster.
  • In general “try harder next time” is a recipe for repeated failure. Think of ways to try differently.

Empowerment

Ensure that the team is set up for success and motivated.

  • Often this includes absolutely crucial human issues like staffing, team composition, interpersonal conflicts, and individual compensation.
  • Happy, well-supported teams are effective teams. Continuously poll the team to find out how to better support them.
  • You’ll need social and emotional intelligence to tackle this, and there’s no shortcut around it.
  • People are set up for failure when their responsibility in a given situation exceeds their ability and control in that situation. Root out these situations and rectify them immediately.
  • Be gracious, thankful, and celebratory where warranted. Genuine appreciation is a powerful motivator.

Autonomy

Avoid being a bottleneck by delegating all possible responsibilities back to the teams.

  • It’s exceedingly difficult for one person to beat the ideas and efforts of an entire team, so it’s foolish to not promote as much shared-responsibility and autonomy as possible for the capabilities of the team. Coaching (as opposed to command-and-control management) is the extreme logical conclusion of this.

  • Fix things within your realm of capability. Raise all other issues (ideally with possible solutions) where possible.

And with all that said…

Usually when I’ve been ineffective (which occurs to some degree on a daily basis), I’ve found that it’s because I’ve failed to achieve one of the above principles. I still struggle with all of these quite regularly myself, and I’m not sure there’s ever a point where they’re “mastered”.

Most of these principles can and should be applied to sideways or upward management as well. Also, see how many you can apply to yourself; at the very least you’ll be a better teammate.

Ultimately there are no points awarded for effort. Effectiveness requires actual results, so even if these principles aren’t getting success for you, it’s your responsibility to throw them out and come up with better ones.

A lot of engineering management and product development process conversation tends to be based on a fundamental assumption that product quality and development speed are always opposing forces. Time and again however I find myself learning and relearning that speed and quality can instead be symbiotic and improving one can also improve the other. These win-win scenarios are actually everywhere as long as you’ve got your mind open to the fact that they’re possible.

One place where I think that this is obvious is the multiple feedback loops around quality. New features go through these loops over and over and so it’s hugely important to optimize them if you want to ship software quickly.

There is also a sometimes argued axiom in software development that bugs found sooner are often the cheapest / fastest to fix. I do agree that the empirical data on the subject is a little light, but I think it follows naturally from the fact that longer feedback loops make systems harder to manage and reason about. For example, if your only quality measure for anything you do is customer reports and you do nothing else for quality whatsoever, solving issues is extremely time-consuming and error-prone.

So lets start with that outer loop and list a bunch of other common quality feedback loops:

  • Customer reported bug
  • Logged error seen during regular inspection
  • Error alert happening when a defect occurs
  • Defect Found by testers before launch
  • Broken CI build
  • Broken test suite
  • In-Editor error

Customer reported bugs

These are of course the most expensive; The feedback loop is largest here. If you’re maintaining a large feature set, it’s possible that you don’t remember the details of how a feature works, or even that the software has that feature at all. These defects also come back at inopportune times and interrupt you while you’re working on other tasks. You’re likely also on a team so the defect probably isn’t even one that you had any hand in. Lacking all of this context makes solving the defect all the more difficult and all the more time-consuming.

There’s the obvious cost to the user as well. Production defects can cost you customers, or in extreme cases can even lead to lawsuits.

Ultimately you want to find a way to shorten the feedback loop, which usually means trying to move this to an early feedback loop.

As long as the measure that tightens the feedback loop is cheaper effort-wise than the defect, you’ve got improved speed and quality. It almost always is cheaper though because these loops are run many times for each feature (though there are some common types of loops that are especially expensive, like manual human regression testing).

Logged errors and metrics seen by regular log inspection

Production logs are a critical part of making an application self-reporting. If you’re regularly checking your logs (and keeping the signal to noise ratio in them high) there’s a good chance of finding production defects before users do.

That’s great because it can catch things weeks or months before customer reports do sometimes, and that faster feedback loop means you’re more likely to remember the affected area of code.

Usually for these types of issues though, we can go one level deeper…

Error alert happening when a defect occurs

If you have a system set up where logged errors increment a metric, you can find a way to put an alert on a threshold for that metric. There are a bunch of services you can integrate for this functionality, or you can run your own kibana service. The point is that your production systems can be self-reporting: they can tell you when there are problems, thus tightening the feedback loop further.

This is also super useful, because it really helps your mean-time-to-repair. Minimizing the amount of time it takes to find a defect in production also helps minimize the amount of time a user is affected by that defect.

Defect Found by testers before launch

Unfortunately I think the most common way of finding defects is manual human inspection. It’s a natural choice of course, but it’s by far the slowest and most error-prone. It’s a valid method if you can’t solve your issues otherwise, but the repeated compounded cost including the time to test, and the way it affects your ability to quickly deliver software shouldn’t be ignored. When a good automated test is possible, it’ll be both faster and less error-prone. I work on a production system that has ~4500 automated tests that run dozens of times per day. Having humans do that is impossible.

With all that said, these are still defects that are found earlier than in production and so they save your customers from the defect, and they lead to a tighter feedback loop. It’s just that this feedback loop is so expensive that as a developer you really shouldn’t be relying on it the way you can rely on even tighter feedback loops.

Broken CI build

The first line of defence after your work leaves your machine is the CI build. Any quality measures you have in your build process (which I’ll get into shortly) should be part of this build and they’ll verify that what you’ve got in the main branch is ready to move on to. If the main branch doesn’t pass the same barrage of quality measures as the local machine build is supposed to, it certainly shouldn’t move past this step on its way to production. It could be that this is your last line of defence before affecting customers, or it could be that you have a human tester that can at least know not to bother testing a broken build (Ideally, passing your quality measures is necessary for any build-artifact to exist at all so that testing a broken a build isn’t even an option).

Of course this is an easy savings to your users and any human testers you might have, but it’s still not the tightest feedback loop you can have. It’s also an expensive measure to your team-mates; it means there’s a period of time where the main branch is unusable, blocking their work.

CI builds really should be doing the same thing as a local developer machine’s build so that developers have a reasonable assurance that if they run the build locally and it works, it should pass in CI as well. Let’s talk about some of the quality measures that should go into a build.

Broken test suite

Automated tests can make up a feedback loop that is almost instant. On most platforms that I’ve worked, testing is fast enough that a single test almost always takes less than a second. I work on a codebase with around ~4500 automated tests that runs in about 2 minutes (albeit due to herculean efforts at parallelization). The speed of these is super important because it makes the feedback loop short and helps prevent developers from relying on CI as a personal build machine.

Comprehensive test suites are expensive! We spent a lot of time maintaining ours, adding to it, and ensuring it stays fast. It’s almost certainly our most effective quality measure too though.

Integration tests tend to be faster to write, because they test more things at once, but when they do fail, you’ve usually got some extensive debugging to do. Unit tests tend to take more time to write if you want the same level of coverage, but when they fail, you usually know exactly where the issue is. These are things to factor into your feedback loop considerations.

There are still tighter feedback loops that are cheaper to maintain though and those should be relied on where possible.

In-editor error

Any type of static analysis that can be performed in your editor/ide, like linting or static type-checking is an even tighter feedback loop still. Any problems are evident instantly while you’re still in the code and it indicates to you exactly where. This is an extremely fast feedback loop that you’ll probably want to employ where it’s possible and it makes sense.

I don’t have a tighter feedback loop than this, but in some cases you can still do better…

Abstraction that simply makes the error impossible

If it’s possible to use tools/abstractions that make the defect impossible, that beats all the feedback loops.

Some examples:

  • Avoid off-by-one errors in for loops by using iterator functions, or functional paradigms that give you filter(), map() and reduce().
  • Avoid SQL injection by using prepared statements.
  • Avoid Cross-site scripting attacks by using an html template language that automatically escapes output.
  • Avoid bugs from unexpected changes in shared objects by using immutable data structures.

Working through the levels

These levels all form a sort of “onion” of quality feedback loops where the closer you get to the middle, the cheaper the defect is.

Thinking this way, you can easily see how if your users are reporting an issue caused by a sql injection attack, you would ideally work to push that problem to tighter and tighter feedback loops where possble. If you can make it show up in logs or alerts, you can fix it before users report it. If you can have testers test for it, you can fix it before users are subjected to it. If you can write some unit tests for it, you can save your testers from having to bother. If you can use the right level of abstraction (prepared statements / parameterized queries in this case), you can eliminate the class of error entirely.

Delivering high-quality software quickly means looking at the most expensive, time-consuming or frequent classes of errors and systematically pushing them to a lower-rung in this onion of quality feedback loops. With a little situational awareness and a little creativity it’s almost always possible and leads to huge cost and time-savings over the long haul.

This is just one of the many ways that I think the speed vs quality dichotomy in software engineering is a false one.

One pattern that I’ve seen work well in software development management is one I call The Atomic Team pattern. Basically what it means is that ideally the smallest indivisible unit for management is the team. I think it’s a cornerstone of healthy service-oriented architecture, but it’s rarely talked about because it’s on the more touchy-feely and less-technical side.

Management of such a team ends up primarily involving:

  • Communicating expectations and objectives
  • Ensuring the team has what it needs to be successful
  • Handling (ideally) infrequent interpersonal issues that the team can’t handle themselves.

Disadvantages

There are a few disadvantages:

  • Teams can take some time to form.
  • Some developers don’t have the interpersonal skills for this.
  • There are always interpersonal issues to consider.
  • Teams are not resilient to reorganization.
  • Will individuals on the team be happy with their role on the team and the type of work that the team takes on?

Advantages

Requires less management

Let the team self-organize around tasks instead of being command-and-control orchestrated. Drop prioritized work into the team work queue and they can figure out who does what and when based on their own availabilities and skillsets. Team members can hold each other accountable for the team objectives.

Lowered Bus-Factor

Cross-functional teams often have more than one person capable of doing a given task. This keeps the queue moving even when someone is sick or on vacation.

Less Stress for Team Members

On a well-formed team, team members can count on one another for help and support. Vacations and sick days rarely affect a delivery timeline much at all. No one needs to be the only one responsible for stressful issues, outages, deadlines, or un-fun work. The increased autonomy in how the team executes on its objectives is highly motivating.

Better cross-pollination of knowledge and skills

Team members can teach each other skills and organizational knowledge so that they level each other up. My best career development has come from working with brilliant peers.

Better solutions

Put simply, two heads are better than one. The best solutions that I’ve seen in software development are most often devised by a team riffing off each others’ ideas with a whiteboard.

Long-term group ownership of codebases and services

In reality, software doesn’t just get deployed and run in perpetuity without engineering intervention. Even if there aren’t any new features to add, there are hardware issues, bugs, security issues, and software upgrades. When teams own the software they write forever, they write better software, they monitor it better, and they manage it more easily. Without team-ownership, you’re often left looking for someone to work on some long forgotten service when it inevitably needs work.

The result of these advantages is an software development organization that scales much more easily.

Anti-patterns when Managing Teams

Team-Busting

Examples:

  • adding / removing the members of the team often
  • loaning out team-members for other non-team objectives
  • holding individuals responsible for results of team objectives
  • creating competition within the team

These types of activities remove the mutual trust and comradery from the team and stop the team from forming beyond “just a group of people”. This stops the members from working as closely together as they could otherwise and undermines a lot of the advantages of having a team.

Over-Management / Micro-Management

Examples:

  • dictating who on the team does what
  • becoming a communication conduit for two team members when they can just talk to each other

These types of activities are mostly unnecessary except in extreme situations so they’re a time-sink and create a bottleneck that makes scaling your organization harder. Additionally, a manager can often make much worse decisions in these areas than the team members because they’re the people closest to the team and the work.

Not Respecting Communication Overhead

Examples:

  • Choosing teams of similarly skilled individuals (eg “The Frontend Team”). Teams that are not cross-functional often can’t deliver their own solutions without a lot of communication with other teams.
  • Choosing teams that are too large. Teams that are too large spend too much time coordinating efforts and coming to consensus on issues.

Obviously from the title, I think software engineering management is in a Stone Age. Before I get into my arguments though, I’d like to say that this isn’t really about any particular manager or managers that I’ve had in the past. It’s really about counter-productive patterns that I’ve seen that I really think we need to evolve from. I’ve also been responsible for some of these mistakes myself as a manager or technical lead over my years in software engineering, so if this comes off preachy, that’s not the intention.

There are a few common problems in engineering management that keep us in the Stone Age though, and I’d like to detail some of them.

The longer you manage, the less technically competent you become.

“Technology is dominated by two types of people: those who understand what they do not manage, and those who manage what they do not understand.” — Putt’s Law

It’s undeniable that knowledge work has some fundamental differences from factory work: there’s never a point where you’ve completely learned how to do your job, and in fact the knowledge required around your job is constantly changing. For this reason, the longer you’re not developing software, the worse you get at it. Being able to type code is only tangentially necessary for software development. The real job is making decisions about how the software should work.

This is one of the few cases in industrialized production where the worker quite regularly knows more about their job than their manager. And if the manager does happen to know more, the gap between them will be constantly narrowing.

The impact of this fact is pretty wide-reaching. It means that a manager is generally not going to be useful for their technical competence, or at least that their usefulness will be waning. In my career I have rarely met exceptions to this rule.

Management is viewed as a promotion and not a distinct role with a separate set of skills.

I’d like to propose instead that management is a distinct role that requires a different set of skills, none of which require the ability to make technical decisions (and I’ll get to those later).

Unfortunately management in the software development industry is looked at much like management in other industrialized production. Management is widely considered to be “a promotion” that developers should aspire to. It’s a common expectation that developers at a certain level of seniority (or age) should be moving into management. There’s a very real pressure from the rest of the world too because they view management as a “higher” position, so you often see developers pushing for this “promotion”, even though they have not acquired the disparate skill-set required of a manager. The result is that often companies trade excellent developers for terrible managers, and often to the detriment of those developers as well.

Management is viewed as an imperative to command and control.

It’s common to move people into management based on technical ability, because surely we must need a technical person to make the important technical decisions, right? That’s certainly the argument that’s most often made to support the management-as-promotion mindset.

There are huge drawbacks to the manager being “The Decider” in technical matters though (notwithstanding their continually eroding technical ability):

  • Even extremely technical managers are less likely to have better ideas than their entire team.
  • This is not a scalable solution. Even on a small team, the manager will not have time to make all the decisions because aside from typing, all software development is fundamentally about making decisions. As the team grows, the manager will be capable of making an ever decreasing number of decisions. Some managers try to mitigate this by carving out which decisions they will make (the important ones!), but this makes them a process choke-point, and increasingly a productivity net-negative.
  • With decision-making being the developer’s primary job, a manager that tries to make decisions is taking away the most interesting part of the job for the developer, and with this their feeling of autonomy, responsibility, accountability and ownership. The result is that this type of manager is actively and powerfully demotivating their team. Ironically, command and control managers seem to be the ones most irritated when their team doesn’t self-organize in their absence. In reality it’s quite a long and difficult cultural change for a “controlled” team to become a self-organizing team and I’ve only seen the presence of a command-and-control manager to be an insurmountable impediment.

It certainly doesn’t have to be this way, but many view management not just as a chance to command and control, but also as a responsibility to command and control. If the team is allowed to make technical decisions together though, on their own, all of these particular problems can just disappear. And I would argue that if a manager doesn’t have a team that can be trusted with this responsibility, it is the manager’s responsibility to transform that team.

Interestingly, it’s a pretty normal refrain from engineers that they want their manager to be technical. I believe that much of that sentiment comes from an assumption that management is command-and-control by definition, and of course the worst of both worlds would be a non-technical command-and-control manager (and I whole-heartedly agree). I also think that if management would instead trust technical decisions to those most close to the work, engineers would be far less likely to care about their manager’s technical competence.

Management focuses on enforcing the predictability of software development rather than on mitigating its undeniable unpredictability.

Software development is development of complex systems that are, by definition, extremely difficult to predict. Typically management tries to predict, control, and reprimand. Given the fact that complex systems are resistant to those types of actions, the effort is largely fruitless and often counter-productive.

Some examples:

  • Deadlines are one such mechanism of control. Despite the fact that engineers are the best people from which to get accurate estimates, it is extremely common for managers to dictate deadlines for fixed-sized projects, and to repeatedly find their team failing to meet them. For some reason, this repeated failure is not an indication that something has gone wrong management-wise. Often the team is chastized, over-worked, or reprimanded instead. Of course the real world will have dates after which delivery of software greatly diminishes its value, but deadlines are simply a thoughtless and reckless way to enforce either minimizing a feature or lowering the level of quality that goes into it (usually the latter). When a business uses deadlines with a fixed scope of work and quality, it’s willfully ignoring the reality of its team’s capability.

  • Estimation is an example of attempting prediction. Complex systems are inherently difficult to predict as well, and we’ve all seen evidence of this in our own inaccurate estimates. The best success we’ve had with improving estimation is to limit the amount of work we bundle up in an estimate which is to say we’ve improved estimation by estimating less. The sum of a project’s timeframe often ends up being greater than the sum of the timeframes of its parts though, because we often even fail to predict all the pieces of work that are necessary.

Estimates from developers, when treated as self-imposed deadlines, are really no better than deadlines, because prediction of how long a software project will take is so incredibly difficult and developers are rarely given the training or time to do better estimates. Estimates like this end up being another mechanism of control, but a somewhat more insidious one, because developers will feel like they have no one to blame but themselves for “missed estimates”.

  • Reprimanding developers for defects or service outages is another often counter-productive mechanism of control. As with many complex systems, there is never a single root cause, and the humans and the technology are so inseparably intertwined in the system that it means nothing to conclude “human failure” and reprimand the human. That’s just not a results-oriented path to improved quality. I’d never suggest that post-mortems are a waste of time, but they are when conducted in a fashion that stops the investigation at “human failure” or even after a single cause. Investigations like this result in the workers feeling terrible, and the process as a whole not actually improving.

All of these management efforts are time-consuming, stressful, and mostly counter-productive (in that they take away from the developer’s time to write software).

Of course you might reasonably ask how the feasibility of a proposed feature can be assessed without some estimation of its costs, which is a fair question, but an effective manager must realize that the estimate (as well as the scope of the work being estimated) must be continually revisited and revised and the stake holders need to have their expectations continually managed/adjusted.

The complex system that the developers create includes all the people involved.

Anytime something happens in the system, it can’t realistically be viewed solely from a technical perspective. The technical aspects exist in large part because of what developers did or did not do to make them happen. The interdependencies between people and the technology forms in such a way that most mental models that separate them are not grounded in reality. There are a few results of this:

  • Developers can’t simply be swapped in and out of the team without significant costs. They form relationships and interdependencies with both the technology and the other people on the team that are time-consuming and difficult to reform. This cost can be minimized by having the team regularly push to make itself pan-functional (usually via intra-team mentoring).

  • Conversely, a developer can have real negative impact on this socio-technological system in many ways, the most detrimental of which is by not acting in a way that’s worthy of the team’s trust. If a team member regularly refuses (explicitly or implicitly) to act in accordance with the team’s general wishes, that developer must be removed, regardless of some abstract notion of “technical skill level” or how “indispensable” they’ve made themselves.

  • The team itself requires maintenance. They need to be encouraged and allowed to regularly take the time to look at their process and practices and decide what steps they should take to improve them. They need to be supported in their efforts to take those steps.

  • A high performing engineering team is more valuable than the sum of its highly performing individuals. A team needs to be cultivated carefully with an eye on creating the desired culture of collaboration and trust. This is extremely difficult and takes a great deal of time and effort, but it pays off wildly with much smarter decisions and much faster development.

  • Software development cannot be broken into an assembly line with specification, coding, and testing stages. All of these things occur and reoccur as necessary. They are all “development” and there’s nothing to gain by pretending that the task is divisible.

The complex system that the developers create has a lifetime.

As a piece of software is used more and more over time, the chances of a user finding a way to put the software in an unexpected state increases. It’s even common for new bugs to arise in narrowly scoped software that’s over 20 years old.

Ultimately the only software that doesn’t require maintenance is software that no one uses anymore. Otherwise, there’s no such thing as completion of a piece of software. At best you can get a convergence toward completion, assuming the project is narrowly scoped and you can resist feature-creep.

Unfortunately management typically looks at software development as a series of projects or features with definite endings where there is an expectation that no more work will be necessary at the completion of the “project”, and that the team will be 100% available for the next project.

This fallacy may or may not lead to completely unrealistic expectations on the first project or two, but as the number of projects “completed” increases over time, the team will become busier and busier maintaining those projects and the team will necessarily get slower and slower.

There are innumerable counter-measures that can be taken to ease the maintenance of software over time of course, but most of these require management to first realize that this reality exists, and to allow the team to spend the time to take these counter-measures.

Management that doesn’t realize this often misses the point of spending time on anything that doesn’t yield value immediately. For example, automated tests are in many scenarios a super valuable way of ensuring regressions don’t creep in. They often don’t provide enough value to offset their immediate up-front costs, but I’ve rarely seen them not be a net positive over time as they eliminate the need for slower manual testing forever after. Short-sighted management will be reluctant to make investments like this, and therefore doom the team to lower productivity over the long haul.

Goals and priorities are rarely clearly and intelligently set.

Goal and priority setting is the absolute number one deliverable that management has for the team. Unfortunately it is common for management to busy itself with other less productive tasks, often involving micro-management, and to actively disrupt the team’s effort to ensure that goals and priorities are clear.

Some common ways that management fails at this:

  • interrupting the team working on the top priority goal to talk about or pursue a lower priority goal
  • not ensuring that the team is shielded from lower priority goals from other parts of the organization.
  • failing to ensure that work is constantly presented in priority order
  • failing to ensure that there is some semblance of reasoning behind the order of priority
  • quickly oscillating between priorities frequently keeping a team starting new things, and rarely finishing anything.
  • failing to coordinate priorities between teams that have dependencies on one another.

The most common failure I’ve seen here is management’s unwillingness to choose a more important goal between two top goals. It’s fine for management to ask for help from the team in that prioritization, or even for management to pick one randomly (if they’re really so equal, it shouldn’t matter), but what is unacceptable is for management to say “These are our two #1 goals”. In that case, the team is forced to take over this management responsibility, because any one person can only do one thing at a time. If the team has not learned to manage that responsibility themselves, they will often be terrible at it. Some members will be working on one thing and others will be working on the other when just a touch of management could have had them coordinating to get one thing done first and delivering value as soon as possible. Instead the manager in these scenarios has ensured that value will be delivered in a slower way than what’s optimal.

Efforts and Practices are rarely critically examined with any attention to the results.

It’s unfortunately extremely uncommon for managers today to pay much attention to the actual results of the team’s effort and the practices that it follows.

I think that a lot of the reason behind this is that it’s extremely difficult to admit that things aren’t going well when we’ve tied those results to our self-image. The only thing worse than mistakes though are mistakes that go uncorrected. Unfortunately the first step to correcting a mistake is to admit it exists.

Ego can also get in the way and take management off into more interesting (or brag-worthy) endeavours than what’s best for the business. I’ve seen this happen countless times where teams deliver absolutely nothing while working on the latest in tech. If the same ego-driven management style reaches high enough into the organization, and the company is profitable enough to support it, it’s easy for a team to get away with this blunder for years (which I’ve also seen).

Copy-cat or Cargo-cult management is probably the next most common excuse for not examining results. Often I’ve heard “google does x” or “y is an industry best practice” or even “that’s not Agile” without any discussion of whether a particular practice makes sense for the team’s own particular goals, scenario, or needs. Often these managers feel that they’re adhering to “best practices” and so the actual results will necessarily be optimal and won’t need to be examined. I’m definitely a proponent of agile methodologies in general, but saying whether or not something is “Agile” explains nothing about it’s utility to the organization. There are no sacred practices that should be allowed to escape scrutiny. Any practice that can’t be shown to be adding value should be canceled immediately.

It’s common as well when a manager does manage to pay attention to results that his/her methods are extremely flawed. Probably the most infamous example of this is how managers used to count lines of code to judge the productivity of a developer. Counting the number of commits is also ridiculously flawed in the same way: they’re missing what it means to be productive in an engineering context entirely. The least productive engineers are doing things like building needlessly elaborate architectures and suffering from NIH when they could just use a far better 3rd party library. Yes, you do need to be able to determine if a developer is productive or not, but that doesn’t mean that that determination can be quantified. There will be some very important things that you need to measure that aren’t quantifiable, so managers need to be comfortable with qualitative measurements.

In short, I think it’s pretty common for engineering management to be actively harmful to their team’s speed and their product’s quality. Because uncorrected management mistakes impact entire teams, it’s quite easy for a manager to have an overall net negative contribution to the organization. Of course it doesn’t need to be like this, but a lot of the common expectations of management will really need to change first.

(Thanks to Chris Frank and Sam Decesare for the feedback!)

Bugs and outages happen. If the team can learn from them though, it’s possible to reclaim some of that lost effectiveness, sometimes even to the point that the learning is more valuable than the effectiveness lost.

The best mechanism I’ve seen to drive that learning is for the team to meet after the outage or defect is no longer an immediate problem and for the team to try to learn from what happened. This meeting is often called the Blameless Post-mortem. These meetings have been the biggest agents of improvement that I’ve ever seen on an engineering team, so if you think you’ll save time by skipping them, you’re probably make a huge mistake.

Institute Blamelessness

It’s crucial to assume that everyone involved in the issue would’ve avoided it if possible, given the chance again. It’s natural for people to want to point the finger (sometimes even at themselves) but if the team allows this, it’ll quickly lead to a culture of cover-ups where nothing can be learned because real information about what happened can’t be extracted.

What’s worse than this is that when you decide who’s responsible, you have no reason to continue to investigate what there is to learn. It seems obvious that since that person is to blame, if they’re either removed or coerced to try harder next time, the issue won’t reoccur.

This reminds me of reading about airplane crashes in the news when the airline concludes the cause was “human error”. That was really the best that that airline could do? They decided to either ask that human to be better next time, or replaced that human with another human. There’s really nothing to learn? No way to improve the aircraft? No way to improve the process? No way to improve the environment? No way to improve the human’s understanding of these? This is an airline I’d be afraid to fly with.

Sidney Dekker’s The Field Guide to Understanding Human Error, an absolutely genius book on the subject, sums it up nicely: “Human error is not the conclusion of an investigation. It is the starting point.”

Look for all the Causes

Often these meetings are called RCAs or Root-Cause Analysis meetings, but I try not to call them that anymore, because there never really ever seems to be a single root cause. Computer systems are complex systems, and so usually multiple things have gone wrong in order for a defect to appear. John Allspaw explains it more succinctly:

Generally speaking, linear chain-of-events approaches are akin to viewing the past as a line-up of dominoes, and reality with complex systems simply don’t work like that. Looking at an accident this way ignores surrounding circumstances in favor of a cherry-picked list of events, it validates hindsight and outcome bias, and focuses too much on components and not enough on the interconnectedness of components.

Also make sure you’ve validated your assumptions on these causes. There’s a powerful tendency to fall prey to What-you-look-for-is-what-you-find principle for the sake of expediency and simplicity.

Dig deeper

It’s usually a good idea to dig deeper than you’d expect when determining how a problem occurred. One practice that forces a deeper investigation is The 5 Whys. Basically you just ask “why?” for each successive answer until you’ve gone at least 5 levels deep, attempting to find a precursor for any problem and its precursors. Often in the deeper parts of this conversation, you end up investigating bigger picture problems, like the company’s values, external pressures, and long-standing misconceptions. These deeper problems often require internal or external social conflicts to be resolved, which often makes them tough, but also high-value.

A few caveats though:

  • It can quickly become “The 5 Whos”. You still want to remain blameless.
  • It assumes that there’s a single chain of cause and effect leading to the defect (and there rarely is).

This second reason is the reason I don’t really care for The 5 Whys practice anymore. John Allspaw’s got some great further discussion about that problem here as well.

Decide on Next Actions as a Team

One common yet major mistake I’ve seen is that a manager hears the description of the defect/outage and says “well we’ll just do x then” without soliciting the input of the team. This is usually a huge mistake because the manager usually doesn’t have the intimate knowledge of the system that the developers do, and even if he/she did, it is extremely unlikely that any single person can consistently decide a better course of action than a team. Also, the more a manager does this, the less likely the team is to feel that it’s their place to try to solve these problems.

With that said, your team should still be conscious that some of the causal factors are not technical at all. Some examples that I’ve personally seen are:

  • The team is over-worked.
  • The team has been under too much pressure to move quickly or to meet unrealistic deadlines.
  • The team wants more training in certain areas.
  • Management is constantly changing priorities and forcing work to be left partially finished.
  • The team’s workflow is too complicated and merges keep leading to undetected defects.
  • Management won’t allow time to be spent on quality measures.

It’s a common mistake for developers to focus only on the technical problems, probably because they’re the most easily controlled by the development team, but I would say for a team to be truly effective, it must be able to address the non-technical factors as well, and often manage up. Great management will pay very close attention to the team’s conclusions.

Resist “Try Harder Next Time”

Hand-in-hand with blamelessness should almost always be a rule that no improvement should involve “Trying harder next time”. That would be assuming someone didn’t try hard enough the last time, and it’s assuming that only effort needs to change in order for the team to be more effective next time. People will either naturally want to try harder next time, or they won’t. Saying “try harder next time” usually won’t change a thing.

In fact you’d usually be more successful, not just by not trying solutions that don’t require more human discipline, but to additionally take that one step further and reduce the level of discipline already required. There’s a great blog post on this by Marco Ament here, and I can tell you, the results in real life are often amazing.

Humans are great at conserving their energy for important things, or things that are likely to cause issues, but the result of that is that unlikely events are often not given much effort at all. This is a common trade-off in all of nature that Erik Hollnagel calls the ETTO (Efficiency Thoroughness Trade-Off) principle. You don’t want your solutions to be fighting an uphill battle against nature.

There’s another kind of strange result of this (I think, anyway) called “the bystander effect”, where often if a problem is the responsibility of multiple people, it’s less likely that any single person will take responsibility for it. This is a real phenomenon and if you’ve worked on a team for any length of time, you’ve seen it happen. You’ll want to try to make sure that whatever solutions you come up with, they won’t fall victim to this bystander effect.

Consider Cost-to-fix vs. Cost-to-endure

It should go without saying that the cost of your solutions for issues should be less than the cost of the issue itself. Sometimes these costs are really hard to estimate given the number of variables and unknowns involved, but it’s at least worth consideration. An early-stage start-up is unlikely to care about being multi-data-center for the sake of redundancy, for instance. It would be ridiculous, on the other hand, for a bank to not seek this level of redundancy.

Consider Process Costs

The second most naive reaction to a bug or outage (after “Try harder next time”) is also usually to add some more human process to the team’s existing process, like more checks, or more over-sight. While you may well conclude that this is the best measure for a given issue, keep in mind that it’s probably also the slowest, most expensive and most error-prone. These sorts of solutions are often the ones that are the simplest to conceive, but if they’re your team’s only reaction to issues, they will build up more and more over time dooming the team to move really slowly while it works its way repeatedly through these processes.

Improve Defect Prevention

There are a tonne of possible ways to try to prevent bugs from reaching production that are too numerous to get into here, but there are two really important ways to evaluate them:

(1) Does the method find bugs quickly and early in development?

The shortness of your feedback loop in detecting bugs is hugely important in making sure that your prevention methods don’t slow the team down so much that their cost outweighs their benefit. Manual human testing by a separate testing team is probably the slowest and most late way to find bugs, whereas syntax highlighters may be both the fastest and earliest for the class of issue that they can uncover (Of course these methods each test completely different things, but they’re mentioned to give an idea of both extremes of feedback loops).

(2) Does the method give you enough information to fix the problem quickly/easily?

This criteria is important though admittedly probably less so than the previous one. You will want to judge your prevention measures on this criteria though, because it’s another criteria that can cost you a lot of time/efficiency. Incidentally, manual human testing is probably the worst by this criteria as well, because testing at that level generally just lets you know that something is broken in a particular area. Unit-testing beats integration-testing in this particular area as well, because unit-testing does a better job of helping you pin-point the issue down to a particular unit (though they don’t actually test for the same types of bugs at all, so it’s a bit of an unfair comparison).

With these two criteria in mind, it’s useful to look at a number of defect prevention measures critically : TDD, unit testing, integration testing, manual testing, beta testing, fuzz-testing, mutation testing , staging environments, dog-fooding, automated screen-shot diffing, static analysis, static typing, linting, pair programming, code reviews, formal proofs, 3rd-party auditing, checklists etc. I’ve tried to be a bit exhaustive in that list, and while I’ve added some options that have never been useful to me, I’ve probably also forgotten a few. There are new preventative measures popping up all the time too.

An amazing example of preventative measures is probably the extremely popular SQLite project. Their discussion of the measures that they take is fascinating.

Remove Human Processes with Automation

So I’ve hinted at this a few times so far, but it should be reiterated that automating otherwise manual human processes can bring a level of speed and consistency to them that humans can’t compete with. Often times this is tricky and it’s not worth it, but these scenarios are getting fewer as technology progresses. There are two huge risks in automation though:

(1) Automation also involves software that can fail. Now you have two systems to maintain and try to keep defect-free.

Often this second system (the automation of the primary system) doesn’t get engineered with the same rigor as the primary system, so it’s possible to automate in a way that is less consistent and more error-prone than a human.

(2) If you’re really going to replace a manual human task, make sure the automation really does outperform the human.

I’ve seen many attempts at automation not actually meet the goal of doing the job as well as a human. It’s not uncommon at all to see examples like teams with 90+% automated test coverage releasing drop-everything defects as awesome as “the customer can’t log in” because some CSS issue makes the login form hidden. A team that sees bugs like this often is almost certainly not ready to remove humans from the testing process, regardless of how many tests they’ve written.

Eliminate Classes of Bugs

When you think about preventing defects without succumbing to “Try Harder Next Time” thought-patterns, one of the most powerful tools is to try to consider how you could make that defect impossible in the future. Often it’s possible to avoid defects by working at levels of better abstraction. Here are a few examples:

  • Avoid off-by-one errors in for loops by using iterator functions, or functional paradigms that give you filter(), map() and reduce().

  • Avoid SQL injection by using prepared statements.

  • Avoid Cross-site scripting attacks by using an html template language that automatically escapes output.

  • Avoid bugs from unexpected changes in shared objects by using immutable data structures.

You may need some creativity here for your own particular defects, but in many cases eliminating the opportunity for a bug to arise is better than trying to catch it when it happens.

For example, I once worked on a team that would occasionally forget to remove console.log() calls from our client-side javascript, which would break the entire site in IE 8. By putting a check for console.log() calls in the build (and breaking it when they exist), we eliminated this class of defect entirely.

Go Beyond Prevention

Defect and outage prevention is only one side of the quality coin as well, though it’s usually the side people naturally try to employ when trying to figure out how to handle defects better in the future. You should of course investigate better prevention measures, but you should also consider solutions that will improve your situation when defects do occur, because failures will always happen.

I personally think it’s entirely unrealistic for all your measures to be preventative. A focus entirely on preventative measures has a tendency to slow down your team’s ability to deliver while at the same time not delivering the level of quality that you could.

With that said, here are a few classes of mitigating measures:

Improve your “Time to Repair”

There’s an interesting metric called MTTR which stands for “Mean Time to Repair/Recovery”, and is basically the average time it takes you to fix a defect/outage. It’s an important metric, because the cost of a defect must include how long that defect was affecting customers. The speed at which you can deliver a fix is going to be a major factor in how defect mitigation. You’ll want to ask questions like:

  • How can we pinpoint problems faster?
  • How can we create fixes faster?
  • How can we verify our fixes faster?
  • How can we deliver fixes faster?

Practices like Continuous Delivery can help here greatly. If you have a 20 minute manual deployment that involves a number of coordinated activities from a number of team members, you will be leaving customers exposed for much longer than a team practicing Continuous Delivery.

Automated testing on its own can be a huge help. If the bulk of your tests is manual, then a fix will take some time to verify (including verifying that it doesn’t break anything else). Teams that rely heavily on manual testing will usually test much less thoroughly on a “hot fix”, which occasionally can lead to worsening of the situation.

In my experience though, nothing affects MTTR as much as the speed at which you can detect defects/outages…

Improve Detection Time

Talk about how quickly you discovered the issue compared to when it was likely to have started. If your customers are discovering your issues, try figuring out if there’s a way that you can beat them to it. Instrumentation (metrics & logging) has been a huge help for me in different organizations for knowing about problems before customers can report them. Information radiators can help keep those metrics ever-present and always on the minds of the team members.

Threshold-based alerting systems (that proactively reach out to the team to tell them about issues) in particular are valuable because they don’t rely on the team to check metrics themselves, and they can short circuit that “human polling loop” and alert the team much faster, or during times that they ordinarily would not be looking (at night, on weekends, etc). It’s pretty easy to see that an alerting system that alerts about an outage on a Friday night can save the customer days of exposure.

Lessen the Impact of Failures

If you can figure out ways for failures to have less impact, that’s a huge win as well. Here are a few examples of ideas I’ve seen come out of one of these meetings:

  • Have deployments go to only a 5% segment of users first for monitoring before going out to 100% of the users.
  • Speed up builds and deployments, so hot-fixes can go out faster.
  • Have an easy way to display a message to users during an outage
  • Improve metrics and logging to speed-up debugging around that particular issue.
  • Set-up off-hours alerting.
  • Have a one-click “revert deployment” mechanism that can instantly revert to a previous deployment in case something goes wrong.
  • Create “bulkheads/partitions” in your applications so that if one part fails, the rest can still function properly. There are many common examples of this in software, including PHP’s request partitioning model, or the browser’s ability to continue despite a javascript exception, even on the same page. Service-oriented architectures often have this quality as well.

You may or may not need some creativity here to come up with your own, but it’s worth the effort.

Be Realistic with Plans For Improvement

Whatever you say you will do as a result of this meeting, make sure that it’s actually realistic, and there’s a realistic plan to get it into the teams future work (eg Who will do it? When?). The best way to have completely useless meetings is to not actually do what you plan to do.

Write Up a Report for the Rest of the Company

The report should say honestly how bad the problem was, in what way (and for how long) customers were affected, and generally what events lead up to it (blamelessly!). Additionally you’ll want to declare the next steps that the team plans to take, so that the company knows you’re a professional engineering team that cares about results as much as the other people in the organization. You should be ready and willing to take questions and comments on this report, and send it to as many interested parties as possible. Often other people in the company will have additional ideas or information, and the transparency makes them feel like these are welcome any time.

The real value in this is that you show that the entire team is uniformly holding itself accountable for the problem, and that any propensity that the rest of organization has for blaming a single person is not in accordance with the engineering team’s views. The engineering team and management should be willing and able to defend any individuals targeted for blame.

Decide What Types of Defects/Outages Necessitate These Meetings

Some organizations are more meeting-tolerant than others, so there’s no hard-fast rule here. If you had one of these meetings for every production defect, you’d probably very quickly have a bunch of solutions in place that greatly reduces the number of defects though (and therefore reduces the number of these meetings!). These meetings are all investments. The more you have, the more they start to pay off, both with quality of product and with speed of delivery (if you stay conscious of that!).

One thing I will recommend though is that you look for recurrences and patterns in these defects/outages. The team will usually benefit disproportionately from more time invested in solving repeated problems.

Last week I had the pleasure of being on a panel about Continuous Testing put on by Electric Cloud. There’s video of the discussion if you’re interested, here.

Additionally, I posted a blog post on the subject on the ClassDojo Engineering blog here.

For posterity, I’ll x-post here as well:

Continuous Testing at ClassDojo

We have thousands of tests and regularly deploy to production multiple times per day. This article is about all the crazy things we do to make that possible.

How we test our API

Our web API runs on node.js, but a lot of what we do should be applicable to other platforms.

On our API alone, we have ~2000 tests.

We have some wacky automated testing strategies:

  • We actually integration test against a database.

  • We actually do api tests via http and bring up a new server for each test and tear it down after.

  • We mock extremely minimally, because there’s little to gain performance-wise in our case, and we want to ensure that integrations work.

  • When we do more unit-type tests, it’s for the sake of convenience of testing a complicated detail of some component, and not for performance.

All ~2000 tests run in under 3 minutes.

It’s also common for us to run the entire suite dozens of times per day.

With that said, I usually do TDD when a defect is found, because I want to make sure that the missing test actually exposes the bug before I kill the bug. It’s much easier to do TDD at that point when the structure of the code is pretty unlikely to need significant changes.

We aim for ~100% coverage: We find holes in test coverage with the istanbul code coverage tool, and we try to close those holes. We’ve got higher than 90% coverage across code written in the last 1.5 years. Ideally we go for 100% coverage, but practically we fall a bit short of that. There are some very edge-case error-scenarios that we don’t bother testing because testing them is extremely time-consuming, and they’re very unlikely to occur. This is a trade-off that we talked about as a team and decided to accept.

We work at higher levels of abstraction: We keep our tests as DRY as possible by extracting a bunch of test helpers that we write along the way, including our api testing library verity.

Speeding up Builds

It’s critical that builds are really fast, because they’re the bottleneck in the feedback cycle of our development and we frequently run a dozen builds per day. Faster builds mean faster development.

We use a linter: We use jshint before testing, because node.js is a dynamic language. This finds common basic errors quickly before we bother with running the test suite.

We cache package repository contents locally: Loading packages for a build from a remote package repository was identified as a bottleneck. You could set up your own npm repository, or just use a cache like npm_lazy. We just use npm_lazy because it’s simple and works. We recently created weezer to cache compiled packages better as well for an ultra-fast npm install when dependencies have not changed.

We measured the real bottlenecks in our tests:

  1. Loading fixtures into the database is our primary bottleneck, so we load fixtures only once for all read-only tests of a given class. It’s probably not a huge surprise to people that this is our bottleneck given that we don’t mock-out database interaction, but we consider this an acceptable bottleneck given that the tests still run extremely fast.

  2. We used to use one of Amazon EC2’s C3 compute cluster instances for jenkins, but switched to a much beefier dedicated hosting server when my laptop could inexplicably run builds consistently faster than EC2 by 5 minutes.

We fail fast: We run full integration tests first so that the build fails faster. Developers can then run the entire suite locally if they want to.

We notify fast: Faster notification of failures/defects leads to diminished impact. We want developers to be notified ASAP of a failure so:

  • We bail on the build at the first error, rather than running the entire build and reporting all errors.
  • We have large video displays showing the test failure.
  • We have a gong sound effect when it happens.

We don’t parallelize tests: We experimented with test suite parallelization, but it’s really complicated on local machines to stop contention on shared resources (the database, network ports) because our tests use the database and network. After getting that to work, we found very little performance improvement, so we just reverted it for simplicity.

Testing Continues after Deployment

We run post-deployment tests: We run basic smoke tests post-deployment as well to verify that the deployment went as expected. Some are manually executed by humans via a service called Rainforest. Any of those can and should be automated in the future though.

We use logging/metrics/alerts: We consider logging and metrics to be part of on-going verification of the production system, so it should also be considered as part of our “test” infrastructure.

Our Testing Process

We do not have a testing department or role. I personally feel that that role is detrimental to developer ownership of testing, and makes continuous deployment extremely difficult/impossible because of how long manual testing takes. I also personally feel that QA personnel are much better utilized elsewhere (doing actual Quality Assurance, and not testing).

Continuous Testing requires Continuous Integration. Continuous Integration doesn’t work with Feature Branches. Now we all work on master (more on that here!).

We no longer use our staging environment. With the way that we test via http, integrated with a database, there isn’t really anything else that a staging server is useful for, testing-wise. The api is deployed straight to a small percentage ~5% of production users once it is built. We can continue testing there, or just watch logs/metrics and decide whether or not to deploy to the remaining ~95% or to revert, both with just the click of a button.

We deploy our frontends against the production api in non-user facing environments, so they can be manually tested against the most currently deployed api.

When we find a defect, we hold blameless post-mortems. We look for ways to eliminate that entire class of defect without taking measures like “more manual testing”, “more process”, or “trying harder”. Ideally solutions are automatable, including more automated tests. When trying to write a test for a defect we try as hard as possible to write the test first, so that we can be sure that the test correctly exposes the defect. In that way, we test the test.

Lessons learned:

  • Our testing practices might not be the best practices for another team. Teams should decide together what their own practices should be, and continually refine them based on real-world results.

  • Find your actual bottleneck and concentrate on that instead of just doing what everyone else does.

  • Hardware solutions are probably cheaper than software solutions.

  • Humans are inconsistent and less comprehensive than automated tests, but more importantly they’re too slow to make CD possible.

Remaining gaps:

  • We’re not that great at client/ui testing (mobile and web), mostly due to lack of effort in that area, but we also don’t have a non-human method of validating styling etc. We’ll be investigating ways to automate screenshots and to look for diffs, so a human can validate styling changes at a glance. We need to continue to press for frontend coverage either way.
  • We don’t test our operational/deployment code well enough. There’s no reason that it can’t be tested as well.
  • We have a really hard time with thresholds and anomaly detection and alerts on production metrics such that we basically have to watch a metrics display all day to see if things are going wrong.

The vast majority of our defects/outages come from these three gaps.

It’s been over a year since I’ve been on a team that does Sprints as part of its engineering process, and I’ve really changed my mind on their usefulness. I’m probably pretty old-school “agile” by most standards, so I still think Scrum is better than what most teams out there are doing, but I think it’s lost me as a proponent for a few reasons, with the most surprising reason (to me) being its use of sprints.

The sprint, for those few engineers that might not have ever developed this way, is simply a time box where you plan your work at the beginning, usually estimating what work will fit within that time box, and then do some sort of review afterward to assess is efficacy, including the work completed, and the accuracy of the estimates.

I’ve been involved in sprints as short as a week and as long as a month, depending on the organization. The short timespan can have an amazingly transformative effect on organizations that are releasing software less often than that: it gets them meeting regularly to re-plan, it gets them improving their process regularly, and it gets a weekly stream of software released so that the rest of the company can get a sense of what is happening and at what pace. These are huge wins: You get regular course-correction, improvement and transparency, all helping to deliver more working software, more reliably. What’s not to love?

Well I’ve always ignored a couple of major complaints about Sprints because I’ve felt that those complaints generally misunderstood the concept of the time box. The first complaint is that a Sprint is just another deadline, and one that looms at a more frequent pace so it just adds more pressure. If your scrummaster or manager treats them like weekly deadlines, he/she is just plain “doing it wrong”. Time boxes are just meant to be a regular re-visiting and re-assessment of what’s currently going on. It’s supposed to be a realistic, sustainable pace so that when observed, the speed that the team is delivering at can start to be relied upon for future projections. Treating it like a deadline has the opposite effect:

  • Developers work at unsustainable paces, inherently meaning that at some point in the future, they will fail to maintain that pace and generally will not be as predictable to the business as they otherwise could be.
  • Developers cut corners, leading to defects, which lead to future interruptions, which leads to the team slowing more and more over time. I’ve been on a team that pretty much only had time for its own defects, so I’ve seen this taken to the absolute extreme.

Here’s the thing though, if it’s supposed to encourage a reliable, sustainable pace, why would you ever call it a “sprint”? “What’s in a name?”, right? Well it turns out that when you want to get people to change the way they work, and you want them to understand the completely foreign concepts you’re bringing to them, it’s absolutely crucial that you name the thing in a way that also explains what it is not.

In Scrum, it’s also common to have a “sprint commitment” where the team “commits” to a body of work to accomplish in that time frame. The commitment is meant to be a rough estimate for the sake of planning purposes, and if a team doesn’t get that work done in that time, it tries to learn from the estimate and be more realistic in the next sprint. Developers are not supposed to be chastized for not meeting the sprint commitment — it’s just an extra piece of information to improve upon and to use for future planning. Obviously naming is hugely important here too, because in every other use of the word, a “commitment” is a pledge or a binding agreement, and this misnomer really influences the way people (mis)understand the concept of sprints. Let’s face it: if people see sprints as just more frequent deadlines (including those implementing them), the fault can’t be entirely theirs.

Sooo… The naming is terrible, but the concept is a good one, right?

Well “iteration” is definitely a much better name, and I hope people use that name more and more.

More importantly though, I’d like to argue that time boxes for planning and delivery are fundamentally flawed.

I’ve personally found that sprint commitments are entirely counter-productive: A team can just gauge its speed for planning purposes based on past performance instead, which is really what the sprint commitment is meant to be. We should be forecasting entirely based on past performance, and adjusting constantly rather than trying to predict, and trying to hold people to predictions.

Also, with planning happening at the beginning of the sprint and software delivery happening at the end, in a regular cadence, the team is much less likely to do things more frequently than this schedule.

Instead of planning and releasing on the schedule of a sprint, we’ve had a lot more success practicing just-in-time planning and continuous delivery.

Just-In-Time Planning

Planning should happen on-demand, not on a schedule. Planning on a schedule often has you planning for things too far ahead, forcing people to try to remember the plan later. I’ve seen an hour’s worth of planning fall apart as soon as we tried to execute the plan, so I really don’t see the point of planning for particular time-spans (especially larger ones). Teams should simply plan for the work they’re currently starting, the natural way, not based on time, but based on the work they have to do. They should plan and re-plan a particular task as frequently as is necessary to deliver it, and they should plan for as long or as short as necessary to come up with a plan that everyone is happy with. Planning naturally like this leads to more frequent situations where developers will say “hold on, we should sit down and talk about this before we continue the current plan”, which is good for everybody involved.

Continuous Delivery

The cadence of delivering software at the end of a sprint almost always means an organization does not deploy the software more often than that cadence. I’ve worked on teams that tried to make “deployment day” somewhere in the middle of the sprint to emphasize that the sprint boundary is just an arbitrary time box, decoupled from deployment, but we never managed to deploy more often than once per sprint, and it really only resulted in a de facto change in the sprint schedule for our QA personnel. The very existence of a time-box puts people in the mindset that releasing software, deploying software and calling software “done” are all the same event and there’s a scheduled time for that. Without the time-box, people start to think more freely about ways to decouple these events and how to safely improve the customer experience more often. Today, free from sprints, I’m safely deploying to production multiple times per day.

Now Scrum proponents will argue that Sprints can be exactly like this — sprints do not preclude JIT planning or continuous delivery — and the time-box just adds additional points of planning and review to ensure that the team is doing the necessary planning and reviewing. While I wholeheartedly agree with this on a theoretical level, this is not what ends up happening in practice: In reality, when you schedule time for particular aspects of work, people tend to wait until that schedule to do that type of work.

And I realize that what I’m suggesting sounds a bit more sloppy, and lacks the formality of a scheduled cadence, but it simply ends up being more natural and more efficient. These concepts aren’t new either — any teams that have been using the venerable kanban) method will be familiar with them.

One additional Caveat

With all this said, if a team is going to continually improve, I haven’t seen a better mechanism than the regular retrospective. I just don’t think Sprints are necessary to make that happen.

This is the last part of my 3-part series on production-quality node.js web applications. I’ve decided to leave error-prevention to the end, because while it’s super-important, it’s often the only strategy developers employ to ensure customers have the most defect-free experience as possible. It’s definitely worth checking out parts I and II if you’re curious about other strategies and really interested in getting significant results.

Error Prevention

With that said, let’s talk about a few error prevention tricks and tactics that I’ve found extremely valuable. There are a bunch of code conventions and practices that can help immensely, and here’s where I’ll get into those. With that said I should add that I’m not going to talk about node-fibers, or harmony-generators, or streamline.js, which all have different solutions – ultimately because I haven’t used them (which is because none of them are considered very ‘standard’ (yet)). There are people using them to solve a few of the issues I’ll talk about though.

High test coverage, extensive testing

There should be no surprise here, but I think test coverage is absolutely essential in any dynamically typed app because so many errors can only happen at runtime. If you haven’t got heavy test coverage, (even coverage itself is valuable!), you will run into all kinds of problems as the codebase gets larger and more complex, and you’ll be too scared to make sweeping changes (like change out your database abstraction, or how sessions are handled). An untested/uncovered codebase will have you constantly shipping codepaths to production that have never been actually executed before anywhere.

The simplest way I’ve found to get code coverage stats (and I’ve used multiple methods in the past) is to use istanbul. Here’s istanbul’s output on a file from one of my npm libraries:

istanbul.png

The places marked in red are the places that my tests are not testing at all. I use this tool all the time to see what I might have missed testing-wise, when I haven’t been doing TDD, and then I try to close up the gaps.

Use a consistent error passing interface and strategy

Asynchronous code should always take a callback that expects an Error object (or null) in the first parameter (the standard node.js convention). This means:

  • Never create an asynchronous function that doesn’t take a callback, no matter how little you care about the result.

  • Never create an asynchronous function that doesn’t take an error parameter as the first parameter in its callback, even if you don’t have a reason for it immediately.

  • Always use an Error object for errors, and not a string, because they have very handy stacktraces.

Following these rules will make things much easier to debug later when things go wrong in production.

Of course there are some cases where you can get away with breaking these rules, but even in those cases, over time, as you use these functions more often and change their internal workings, you’ll often be glad that you followed these rules.

Use a consistent error-checking and handling strategy.

Always check the errors from a callback. If you really don’t care about an error (which should probably be rare), do something like this to make it obvious:

1
2
3
4
5
getUser(userId, function(err, user){
  if (err){
    // do nothing.  we didn't need a user that badly
  }
});

Even then, it’s probably best to log the error at the very least. It’s extremely rare that you’d bother writing code of which you don’t care about the success/failure.

In many cases you won’t be expecting an error, but you’ll want to know about it. In those cases, you can do something like this to at least get it into your logs:

1
2
3
4
5
6
getUser(userId, function(err, user){
  if (err){
    console.error('unexpected error', err, err.stack);
    // TODO actually handle the error though!
  }
});

You’re going to want to actually handle the unexpected error there too of course.

If you’re in some nested callback, you can just pass it up to a higher callback for that callback to handle:

1
2
3
4
5
getUser(userId, function(err, user){
  if (err){
    return cb(err);
  }
});

NB: I almost always use return when calling callbacks to end execution of the current function right there and to save myself from having to manage a bunch of if..else scenarios. Early returns are just so much simpler in general that I rarely find myself using else anywhere anymore. This also saves you from becoming another “can’t set headers after they are sent” casualty.

If you’re at the top of a web request and you’re dealing with an error that you didn’t expect, you should send a 500 status code, because that’s the appropriate code for errors that you didn’t write appropriate handling for.

1
2
3
4
5
6
7
getUser(userId, function(err, user){
  if (err){
    console.error('unexpected error', err, err.stack);
    res.writeHead(500);
    res.end("Internal Server Error");
  }
});

In general, you should always be handling or passing up every single error that could possibly occur. This will make surfacing a particular error much easier. Heavy production use has an uncanny ability to find itself in pretty much any codepath you can write.

Watch out for error events.

Any EventEmitters in your code (including streams) that emit an error event absolutely need to have a listener for those error events. Uncaught error events will bring down your entire process, and usually end up being the main cause of process crashes in projects that I’ve worked on.

The next question will be how you handle those errors, and you’ll have to decide on a case-by-case basis. Usually you’ll at least want to 500 on those as well, and probably log the issue.

Managing Errors at Runtime

Domains

Domains are without a doubt your best tool for catching errors in runtime that you missed at development time. There are three places that I’ve used them to great effect:

  1. To wrap the main process. When it dies, a domain catches the error that caused it.
  2. To wrap cluster’s child processes. If you use cluster-master like I do, or if you use cluster.setupMaster() with a different file specified via exec for child processes, you’ll want the contents of the file to be wrapped in a domain as well. This means that when a child process has an uncaught error, this domain catch it.
  3. To wrap the request and response objects of each http request. This makes it much more rare that any particular request will take down your process. I just use node-domain-middleware to do this for me (This seems like it should work in any connect-compatible server as well, despite the name).

In the case of catching an error with a domain on the main process, or a child process, you should usually just log the issue (and bump a restart metric), and let the process restart (via a process manager for the main process, or via cluster for the child process — see part I for details about this). You probably do not want to the process to carry on, because if you knew enough about this type of error to catch it, you should have caught it before it bubble up to the process level. Since this is an unexpected error, it could have unexpected consequences on the state of your application. It’s much cleaner and safer to just let the process restart.

If you’ve caught a an error from request or response object, it’s probably safe and appropriate to send a 500 status code, log the error (and stack trace) and bump a 500 metric.

Also, I’ve written a module that helps me test that my domains are correctly in place: error-monkey. You can temporarily insert error-monkey objects in different places in your codebase to ensure that domains are catching them correctly. This way, you’re not tweaking the location of your domains and waiting for errors to happen in production to verify them.

Gracefully restarting

Now that domains have given you a place to catch otherwise uncaught exceptions and you can hook in your own restart logic, you’ll want to think about gracefully restarting rather than just killing the process abruptly and dropping all outstanding requests in progress. Instead you’ll want tp restart only after all current requests have completed ( see : http://nodejs.org/api/http.html#http_server_close_callback ). This way, the process can be restarted more gracefully.

If you’re running a bunch of node processes (usually with cluster), and a number of servers (usually behind a load-balancer), it shouldn’t be that catastrophic to close one instance temporarily, as long as it’s done in a graceful manner.

Additionally, you may want to force a restart in the case that the graceful shutdown is taking an inordinate amount of time (probably due to hanging connections). The right amount of time probably has a lot to do with what types of requests you’re fulfilling, and you’ll want to think about what makes sense for your application.

With that said…

These are just my notes on what I’ve tried so far that achieved worthy results. I’d be interested to hear if anyone else has tried any other approaches for preventing, mitigating, and detecting defects.

This is the second part of a three part series on making node.js web apps that are (what I would consider to be) production quality. The first part really just covered the basics, so if you’re missing some aspect covered in the basics, you’re likely going to have some issues with the approaches discussed here.

First, I’m going to talk about a few major classes of defects and how to detect them.

I could go straight to talking about preventing/avoiding them, but detection is really the first step. You’re simply not going to get a low-defect rate by just using prevention techniques without also using detection techniques. Try to avoid the junior engineer mindset of assuming that “being really careful” is enough to achieve a low defect rate. You can get a much lower defect rate and move much faster by putting systems in place where your running application is actively telling you about defects that it encounters. This really is a case where working smart pays much higher dividends than working hard.

The types of defects you’re most likely to encounter

I’m a huge fan of constantly measuring the quality aspects that are measurable. You’re not going to detect all possible bugs this way, but you will find a lot of them, and detection this way is a lot more comprehensive than manual testing and a lot faster than waiting for customers to complain.

There are (at least) 4 types of error scenarios that you’re going to want to eliminate:

  • restarts
  • time-outs
  • 500s
  • logged errors

Restarts

Assuming you’ve got a service manager to restart your application when it fails and a cluster manager to restart a chlid process when one fails (like I talked about in Part I), one of the worst types of error-scenarios that you’re likely to encounter will be when a child process dies and has to restart. It’s a fairly common mistke to believe that once you have automatic restarting working, process failure is no longer a problem. That’s really only the beginning of a solution to the problem though. Here’s why:

  • Node.js is built to solve the C10K problem, so you should expect to have a high number of requests per process in progress at any given time.

  • The node process itself handles your web-serving, and there’s no built-in request isolation like you might find in other frameworks like PHP, where one request can’t have any direct effects on another request. With a node.js web application, any uncaught exception or unhandled error event will bring down the entire server, including your thousands of connections in progress.

  • Node.js’s asynchronous nature and multiple ways to express errors (exceptions, error events, and callback parameters) make it difficult to anticipate or catch errors more holistically as your codebase gets larger, more complex, and has more external or 3rd-party dependencies.

These three factors combine to make restarts both deadly and difficult to avoid unless you’re very careful.

Detection: The easiest way to detect these is to put metrics and logging at the earliest place in both your cluster child code and your cluster master code to tell you when the master or child processes start. If you want to remove the noise caused by new servers starting up, or normal deployments, then you may want to write something a little more complex that can specifically detect abnormal restarts (I’ve got a tiny utility for that called restart-o-meter too).

You might have an aggregated logging solution than can send you alerts based on substrings that it finds in the logs too, or that can feed directly into a time-series metrics system.

Time-outs

Time-outs are another type of failed request, where the server didn’t respond within some threshold that you define. This is pretty common if you forget to call res.end(), or a response just takes too long.

Detection: You can just write a quick and dirty middleware like this to detect them:

1
2
3
4
5
6
7
8
9
10
11
app.use(function(req, res, next){
  timeoutThreshold = 10000; // 10 seconds
  res.timeoutCheck = setTimeout(function(){
    metrics.increment("slowRequest");
    console.error("slow request: ", req.method.toUpperCase(), req.url);
  }, timeoutThreshold);
  res.on('finish', function(evt){
      clearTimeout(res.timeoutCheck);
  });
  next();
});

(or you can grab this middleware I wrote that does basically the same thing).

500s

On a web application, one of the first things I want to ensure is that we’re using proper status codes. This isn’t just HTTP nerd talk (though no one will say I’m innocent of that), but rather a great system of easily categorizing the nature of your traffic going through your site. I’ve written about the common anti-patterns here, but the gist of it is:

Respond with 500-level errors if the problem was a bug in your app, and not in the client.

This lets you know that there was a problem, and it’s your fault. Most likely you only ever need to know 500 Internal Server Error to accomplish this.

Respond with 400-level errors if the problem was a bug in the client.

This lets you know that there was a problem and it was the client app’s fault. If you learn these 4 error codes, you’ll have 90% of the cases covered:

  • 400 Bad Request — When it’s a client error, and you don’t have a better code than this, use this. You can always give more detail in the response body.
  • 401 Not Authenticated — When they need to be logged in, but aren’t.
  • 403 Forbidden — When they’re not allowed to do do something
  • 404 Not Found — When a url doesn’t point to an existing thing.

If you don’t use these status codes properly, you won’t be able to distinguish between:

  • successes and errors
  • errors that are the server’s responsibility and errors that are the client’s responsibility.

These distinctions are dead-simple to make and massively important for determining if errors are occuring, and if so, where to start debugging.

If you’re in the context of a request and you know to expect exceptions or error events from a specific piece of code, but don’t know/care exactly what type of exception, you’re probably going to want to log it with console.error() and respond with a 500, indicating to the user that there’s a problem with your service. Here are a couple of common scenarios:

  • the database is down, and this request needs it
  • some unexpected error is caught
  • some api for sending email couldn’t connect to the smtp service
  • etc, etc

These are all legitimate 500 scenarios that tell the user “hey the problem is on our end, and there’s no problem with your request. You may be able to retry the exact same request later”. A number of the “unexpected errors” that you might catch though will indicate that your user actually did send some sort of incorrect request. In that case, you want to respond with a reasonable 4xx error instead (often just a 400 for “Bad Request”) that tells them what they did wrong.

Either way, you generally don’t want 500s at all. I get rid of them by fixing the issues that caused them, or turning them into 4xxs where appropriate (eg. when bad user input is causing the 500), to tell the client that the problem is on their end. The only times that I don’t try to change a response from a 500 is when it really is some kind of internal server error (like the database is down), and not some programming error on my part or a bad client request.

Detection: 500s are important enough that you’re going to want to have a unified solution for them. There should be some simple function that you can call in all cases where you plan to respond with a 500 that will log your 500, the error that caused it, and the .stack property of that error, as well as incrementing a metric, like so:

1
2
3
4
5
6
7
var internalServerError = function(err, req, res, body){
    console.log("500 ", req.method, req.url);
    console.error("Internal Server Error", err, err.stack);
    metrics.increment("internalServerError");
    res.statusCode = 500;
    res.end(body);
};

Having a unified function you can call like this (possibly monkey-patched onto the response object by a middleware, for convenience) gives you the ability to change how 500’s are logged and tracked everywhere, which is good, because you’ll probably want to tweak it fairly often.

You’re probably actually going to use this on most asynchronous functions in your HTTP handling code (controllers?) that you call that you don’t expect an error on. Here’s an example:

1
2
3
4
5
6
7
8
function(req, res){
    db.user.findById(someId, function(err, user){
        if (err) return internalServerError(err, req, res, "Internal Server Error");
        // ...
        // and normal stuff with the user object goes here
        // ...
    });
}

In this case, I just expect to get a user back, and not get any errors (like the database is disconnected, or something), so I just put the 500 handler in the case that an err object is passed, and go back to my happy-path logic. If 500s start to show up there at runtime, I’ll be able to decide if they should be converted to a 4xx error, or fixed.

For example if I start seeing errors where err.message is “User Not Found”, as if the client requested a non-existant user, I might add 404 handling like so:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
function(req, res){
    db.user.findById(someId, function(err, user){
        if (err) {
            if (err.message === "User Not Found"){
                res.status = 404;
                return res.end("User not found!")
            }
            return internalServerError(err, req, res, "Internal Server Error");
        }
        // ...
        // and normal stuff with the user object goes here
        // ...
    });
}

Conversely, if I start seeing errors where err.message is “Database connection lost”, which is a valid 500 scenario, I might not add any specific handling for that scenario. Instead, I’d start looking into solving how the database connection is getting lost.

If you’re building a JSON API, I’ve got a great middleware for unified error-handling (and status codes in general) called json-status.

A unified error handler like this leaves you with the ability to expand all internal server error handling later too, when you get additional ideas. For example, We’ve also added the ability for it to log the requesting user’s information, if the user is logged in.

Logged Errors

I often make liberal use of console.error() to log errors and their .stack properties when debugging restarts and 500s, or just reporting different errors that shouldn’t necessarily have impact on the http response code (like errors in fire-and-forget function calls where we don’t care about the result enough to call the request a failure).

Detection: I ended up adding a method to console called console.flog() (You can name the method whatever you want, instead of ‘flog’ of course! I’m just weird that way.) that acts just like console.error() (ultimately by calling it with the same arguments), but also increments a “logged error” metric like so:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
console.flog = function(){
  if (metrics.increment){
    metrics.increment("loggedErrors");
    // assume `metrics` is required earlier
  }

  var args = Array.prototype.slice.call(arguments);
  if (process.env.NODE_ENV !== "testing"){
    args.unshift("LOGGED ERROR:");
    // put a prefix on error logs to make them easier to search for
  }

  console.error.apply(console, args);
};

With this in place, you can convert all your console.error()s to console.flog()s and your metrics will be able to show you when logged errors are increasing.

It’s nice to have it on console because it sort of makes sense there and console is available everywhere without being specifically require()ed. I’m normally against this kind of flagrant monkey-patching, but it’s really just too convenient in this case.

Log Levels

I should note too that I don’t use traditional log levels (error/debug/info/trace) anymore, because I don’t find them all that useful. I’m logging everything as an error or not, and I generally just strive to keep the logs free of everything other than lines that indicate the requests being made like so:

1
2
3
4
5
6
  ...
  POST /api/session 201
  GET /api/status 200
  POST /api/passwordReset 204
  GET /api/admin.php 404
  ...

That doesn’t mean that I don’t sometimes need debug output, but it’s so hard to know what debug output I need in advance that I just add it as needed and redeploy. I’ve just never found other log-levels to be useful.

More About Metrics

All of the efforts above have been to make defects self-reporting and visible. Like I said in the first part of this series, the greatest supplement to logging that I’ve found is the time-series metrics graph. Here’s how it actually looks (on a live production system) in practice for the different types of defects that have been discussed here.

The way to make metrics most effective, is to keep them on a large video display in the room where everyone is working. It might seem like this shouldn’t make a difference if everyone can just go to their own web browsers to see the dashboard whenever they want, but it absolutely has a huge difference in impact. Removing that minor impediment and just having metrics always available at a glance results in people checking them far more often.

Without metrics like this, and a reasonable aggregated logging solution, you are really flying blind, and won’t have any idea what’s going on in your system from one deployment to the next. I personally can’t imagine going back to my old ways that didn’t have these types of instrumentation.