(Thanks to Chris Frank and Sam Decesare for the feedback!)

Bugs and outages happen. If the team can learn from them though, it’s possible to reclaim some of that lost effectiveness, sometimes even to the point that the learning is more valuable than the effectiveness lost.

The best mechanism I’ve seen to drive that learning is for the team to meet after the outage or defect is no longer an immediate problem and for the team to try to learn from what happened. This meeting is often called the Blameless Post-mortem. These meetings have been the biggest agents of improvement that I’ve ever seen on an engineering team, so if you think you’ll save time by skipping them, you’re probably make a huge mistake.

Institute Blamelessness

It’s crucial to assume that everyone involved in the issue would’ve avoided it if possible, given the chance again. It’s natural for people to want to point the finger (sometimes even at themselves) but if the team allows this, it’ll quickly lead to a culture of cover-ups where nothing can be learned because real information about what happened can’t be extracted.

What’s worse than this is that when you decide who’s responsible, you have no reason to continue to investigate what there is to learn. It seems obvious that since that person is to blame, if they’re either removed or coerced to try harder next time, the issue won’t reoccur.

This reminds me of reading about airplane crashes in the news when the airline concludes the cause was “human error”. That was really the best that that airline could do? They decided to either ask that human to be better next time, or replaced that human with another human. There’s really nothing to learn? No way to improve the aircraft? No way to improve the process? No way to improve the environment? No way to improve the human’s understanding of these? This is an airline I’d be afraid to fly with.

Sidney Dekker’s The Field Guide to Understanding Human Error, an absolutely genius book on the subject, sums it up nicely: “Human error is not the conclusion of an investigation. It is the starting point.”

Look for all the Causes

Often these meetings are called RCAs or Root-Cause Analysis meetings, but I try not to call them that anymore, because there never really ever seems to be a single root cause. Computer systems are complex systems, and so usually multiple things have gone wrong in order for a defect to appear. John Allspaw explains it more succinctly:

Generally speaking, linear chain-of-events approaches are akin to viewing the past as a line-up of dominoes, and reality with complex systems simply don’t work like that. Looking at an accident this way ignores surrounding circumstances in favor of a cherry-picked list of events, it validates hindsight and outcome bias, and focuses too much on components and not enough on the interconnectedness of components.

Also make sure you’ve validated your assumptions on these causes. There’s a powerful tendency to fall prey to What-you-look-for-is-what-you-find principle for the sake of expediency and simplicity.

Dig deeper

It’s usually a good idea to dig deeper than you’d expect when determining how a problem occurred. One practice that forces a deeper investigation is The 5 Whys. Basically you just ask “why?” for each successive answer until you’ve gone at least 5 levels deep, attempting to find a precursor for any problem and its precursors. Often in the deeper parts of this conversation, you end up investigating bigger picture problems, like the company’s values, external pressures, and long-standing misconceptions. These deeper problems often require internal or external social conflicts to be resolved, which often makes them tough, but also high-value.

A few caveats though:

  • It can quickly become “The 5 Whos”. You still want to remain blameless.
  • It assumes that there’s a single chain of cause and effect leading to the defect (and there rarely is).

This second reason is the reason I don’t really care for The 5 Whys practice anymore. John Allspaw’s got some great further discussion about that problem here as well.

Decide on Next Actions as a Team

One common yet major mistake I’ve seen is that a manager hears the description of the defect/outage and says “well we’ll just do x then” without soliciting the input of the team. This is usually a huge mistake because the manager usually doesn’t have the intimate knowledge of the system that the developers do, and even if he/she did, it is extremely unlikely that any single person can consistently decide a better course of action than a team. Also, the more a manager does this, the less likely the team is to feel that it’s their place to try to solve these problems.

With that said, your team should still be conscious that some of the causal factors are not technical at all. Some examples that I’ve personally seen are:

  • The team is over-worked.
  • The team has been under too much pressure to move quickly or to meet unrealistic deadlines.
  • The team wants more training in certain areas.
  • Management is constantly changing priorities and forcing work to be left partially finished.
  • The team’s workflow is too complicated and merges keep leading to undetected defects.
  • Management won’t allow time to be spent on quality measures.

It’s a common mistake for developers to focus only on the technical problems, probably because they’re the most easily controlled by the development team, but I would say for a team to be truly effective, it must be able to address the non-technical factors as well, and often manage up. Great management will pay very close attention to the team’s conclusions.

Resist “Try Harder Next Time”

Hand-in-hand with blamelessness should almost always be a rule that no improvement should involve “Trying harder next time”. That would be assuming someone didn’t try hard enough the last time, and it’s assuming that only effort needs to change in order for the team to be more effective next time. People will either naturally want to try harder next time, or they won’t. Saying “try harder next time” usually won’t change a thing.

In fact you’d usually be more successful, not just by not trying solutions that don’t require more human discipline, but to additionally take that one step further and reduce the level of discipline already required. There’s a great blog post on this by Marco Ament here, and I can tell you, the results in real life are often amazing.

Humans are great at conserving their energy for important things, or things that are likely to cause issues, but the result of that is that unlikely events are often not given much effort at all. This is a common trade-off in all of nature that Erik Hollnagel calls the ETTO (Efficiency Thoroughness Trade-Off) principle. You don’t want your solutions to be fighting an uphill battle against nature.

There’s another kind of strange result of this (I think, anyway) called “the bystander effect”, where often if a problem is the responsibility of multiple people, it’s less likely that any single person will take responsibility for it. This is a real phenomenon and if you’ve worked on a team for any length of time, you’ve seen it happen. You’ll want to try to make sure that whatever solutions you come up with, they won’t fall victim to this bystander effect.

Consider Cost-to-fix vs. Cost-to-endure

It should go without saying that the cost of your solutions for issues should be less than the cost of the issue itself. Sometimes these costs are really hard to estimate given the number of variables and unknowns involved, but it’s at least worth consideration. An early-stage start-up is unlikely to care about being multi-data-center for the sake of redundancy, for instance. It would be ridiculous, on the other hand, for a bank to not seek this level of redundancy.

Consider Process Costs

The second most naive reaction to a bug or outage (after “Try harder next time”) is also usually to add some more human process to the team’s existing process, like more checks, or more over-sight. While you may well conclude that this is the best measure for a given issue, keep in mind that it’s probably also the slowest, most expensive and most error-prone. These sorts of solutions are often the ones that are the simplest to conceive, but if they’re your team’s only reaction to issues, they will build up more and more over time dooming the team to move really slowly while it works its way repeatedly through these processes.

Improve Defect Prevention

There are a tonne of possible ways to try to prevent bugs from reaching production that are too numerous to get into here, but there are two really important ways to evaluate them:

(1) Does the method find bugs quickly and early in development?

The shortness of your feedback loop in detecting bugs is hugely important in making sure that your prevention methods don’t slow the team down so much that their cost outweighs their benefit. Manual human testing by a separate testing team is probably the slowest and most late way to find bugs, whereas syntax highlighters may be both the fastest and earliest for the class of issue that they can uncover (Of course these methods each test completely different things, but they’re mentioned to give an idea of both extremes of feedback loops).

(2) Does the method give you enough information to fix the problem quickly/easily?

This criteria is important though admittedly probably less so than the previous one. You will want to judge your prevention measures on this criteria though, because it’s another criteria that can cost you a lot of time/efficiency. Incidentally, manual human testing is probably the worst by this criteria as well, because testing at that level generally just lets you know that something is broken in a particular area. Unit-testing beats integration-testing in this particular area as well, because unit-testing does a better job of helping you pin-point the issue down to a particular unit (though they don’t actually test for the same types of bugs at all, so it’s a bit of an unfair comparison).

With these two criteria in mind, it’s useful to look at a number of defect prevention measures critically : TDD, unit testing, integration testing, manual testing, beta testing, fuzz-testing, mutation testing , staging environments, dog-fooding, automated screen-shot diffing, static analysis, static typing, linting, pair programming, code reviews, formal proofs, 3rd-party auditing, checklists etc. I’ve tried to be a bit exhaustive in that list, and while I’ve added some options that have never been useful to me, I’ve probably also forgotten a few. There are new preventative measures popping up all the time too.

An amazing example of preventative measures is probably the extremely popular SQLite project. Their discussion of the measures that they take is fascinating.

Remove Human Processes with Automation

So I’ve hinted at this a few times so far, but it should be reiterated that automating otherwise manual human processes can bring a level of speed and consistency to them that humans can’t compete with. Often times this is tricky and it’s not worth it, but these scenarios are getting fewer as technology progresses. There are two huge risks in automation though:

(1) Automation also involves software that can fail. Now you have two systems to maintain and try to keep defect-free.

Often this second system (the automation of the primary system) doesn’t get engineered with the same rigor as the primary system, so it’s possible to automate in a way that is less consistent and more error-prone than a human.

(2) If you’re really going to replace a manual human task, make sure the automation really does outperform the human.

I’ve seen many attempts at automation not actually meet the goal of doing the job as well as a human. It’s not uncommon at all to see examples like teams with 90+% automated test coverage releasing drop-everything defects as awesome as “the customer can’t log in” because some CSS issue makes the login form hidden. A team that sees bugs like this often is almost certainly not ready to remove humans from the testing process, regardless of how many tests they’ve written.

Eliminate Classes of Bugs

When you think about preventing defects without succumbing to “Try Harder Next Time” thought-patterns, one of the most powerful tools is to try to consider how you could make that defect impossible in the future. Often it’s possible to avoid defects by working at levels of better abstraction. Here are a few examples:

  • Avoid off-by-one errors in for loops by using iterator functions, or functional paradigms that give you filter(), map() and reduce().

  • Avoid SQL injection by using prepared statements.

  • Avoid Cross-site scripting attacks by using an html template language that automatically escapes output.

  • Avoid bugs from unexpected changes in shared objects by using immutable data structures.

You may need some creativity here for your own particular defects, but in many cases eliminating the opportunity for a bug to arise is better than trying to catch it when it happens.

For example, I once worked on a team that would occasionally forget to remove console.log() calls from our client-side javascript, which would break the entire site in IE 8. By putting a check for console.log() calls in the build (and breaking it when they exist), we eliminated this class of defect entirely.

Go Beyond Prevention

Defect and outage prevention is only one side of the quality coin as well, though it’s usually the side people naturally try to employ when trying to figure out how to handle defects better in the future. You should of course investigate better prevention measures, but you should also consider solutions that will improve your situation when defects do occur, because failures will always happen.

I personally think it’s entirely unrealistic for all your measures to be preventative. A focus entirely on preventative measures has a tendency to slow down your team’s ability to deliver while at the same time not delivering the level of quality that you could.

With that said, here are a few classes of mitigating measures:

Improve your “Time to Repair”

There’s an interesting metric called MTTR which stands for “Mean Time to Repair/Recovery”, and is basically the average time it takes you to fix a defect/outage. It’s an important metric, because the cost of a defect must include how long that defect was affecting customers. The speed at which you can deliver a fix is going to be a major factor in how defect mitigation. You’ll want to ask questions like:

  • How can we pinpoint problems faster?
  • How can we create fixes faster?
  • How can we verify our fixes faster?
  • How can we deliver fixes faster?

Practices like Continuous Delivery can help here greatly. If you have a 20 minute manual deployment that involves a number of coordinated activities from a number of team members, you will be leaving customers exposed for much longer than a team practicing Continuous Delivery.

Automated testing on its own can be a huge help. If the bulk of your tests is manual, then a fix will take some time to verify (including verifying that it doesn’t break anything else). Teams that rely heavily on manual testing will usually test much less thoroughly on a “hot fix”, which occasionally can lead to worsening of the situation.

In my experience though, nothing affects MTTR as much as the speed at which you can detect defects/outages…

Improve Detection Time

Talk about how quickly you discovered the issue compared to when it was likely to have started. If your customers are discovering your issues, try figuring out if there’s a way that you can beat them to it. Instrumentation (metrics & logging) has been a huge help for me in different organizations for knowing about problems before customers can report them. Information radiators can help keep those metrics ever-present and always on the minds of the team members.

Threshold-based alerting systems (that proactively reach out to the team to tell them about issues) in particular are valuable because they don’t rely on the team to check metrics themselves, and they can short circuit that “human polling loop” and alert the team much faster, or during times that they ordinarily would not be looking (at night, on weekends, etc). It’s pretty easy to see that an alerting system that alerts about an outage on a Friday night can save the customer days of exposure.

Lessen the Impact of Failures

If you can figure out ways for failures to have less impact, that’s a huge win as well. Here are a few examples of ideas I’ve seen come out of one of these meetings:

  • Have deployments go to only a 5% segment of users first for monitoring before going out to 100% of the users.
  • Speed up builds and deployments, so hot-fixes can go out faster.
  • Have an easy way to display a message to users during an outage
  • Improve metrics and logging to speed-up debugging around that particular issue.
  • Set-up off-hours alerting.
  • Have a one-click “revert deployment” mechanism that can instantly revert to a previous deployment in case something goes wrong.
  • Create “bulkheads/partitions” in your applications so that if one part fails, the rest can still function properly. There are many common examples of this in software, including PHP’s request partitioning model, or the browser’s ability to continue despite a javascript exception, even on the same page. Service-oriented architectures often have this quality as well.

You may or may not need some creativity here to come up with your own, but it’s worth the effort.

Be Realistic with Plans For Improvement

Whatever you say you will do as a result of this meeting, make sure that it’s actually realistic, and there’s a realistic plan to get it into the teams future work (eg Who will do it? When?). The best way to have completely useless meetings is to not actually do what you plan to do.

Write Up a Report for the Rest of the Company

The report should say honestly how bad the problem was, in what way (and for how long) customers were affected, and generally what events lead up to it (blamelessly!). Additionally you’ll want to declare the next steps that the team plans to take, so that the company knows you’re a professional engineering team that cares about results as much as the other people in the organization. You should be ready and willing to take questions and comments on this report, and send it to as many interested parties as possible. Often other people in the company will have additional ideas or information, and the transparency makes them feel like these are welcome any time.

The real value in this is that you show that the entire team is uniformly holding itself accountable for the problem, and that any propensity that the rest of organization has for blaming a single person is not in accordance with the engineering team’s views. The engineering team and management should be willing and able to defend any individuals targeted for blame.

Decide What Types of Defects/Outages Necessitate These Meetings

Some organizations are more meeting-tolerant than others, so there’s no hard-fast rule here. If you had one of these meetings for every production defect, you’d probably very quickly have a bunch of solutions in place that greatly reduces the number of defects though (and therefore reduces the number of these meetings!). These meetings are all investments. The more you have, the more they start to pay off, both with quality of product and with speed of delivery (if you stay conscious of that!).

One thing I will recommend though is that you look for recurrences and patterns in these defects/outages. The team will usually benefit disproportionately from more time invested in solving repeated problems.

Last week I had the pleasure of being on a panel about Continuous Testing put on by Electric Cloud. There’s video of the discussion if you’re interested, here.

Additionally, I posted a blog post on the subject on the ClassDojo Engineering blog here.

For posterity, I’ll x-post here as well:

Continuous Testing at ClassDojo

We have thousands of tests and regularly deploy to production multiple times per day. This article is about all the crazy things we do to make that possible.

How we test our API

Our web API runs on node.js, but a lot of what we do should be applicable to other platforms.

On our API alone, we have ~2000 tests.

We have some wacky automated testing strategies:

  • We actually integration test against a database.

  • We actually do api tests via http and bring up a new server for each test and tear it down after.

  • We mock extremely minimally, because there’s little to gain performance-wise in our case, and we want to ensure that integrations work.

  • When we do more unit-type tests, it’s for the sake of convenience of testing a complicated detail of some component, and not for performance.

All ~2000 tests run in under 3 minutes.

It’s also common for us to run the entire suite dozens of times per day.

With that said, I usually do TDD when a defect is found, because I want to make sure that the missing test actually exposes the bug before I kill the bug. It’s much easier to do TDD at that point when the structure of the code is pretty unlikely to need significant changes.

We aim for ~100% coverage: We find holes in test coverage with the istanbul code coverage tool, and we try to close those holes. We’ve got higher than 90% coverage across code written in the last 1.5 years. Ideally we go for 100% coverage, but practically we fall a bit short of that. There are some very edge-case error-scenarios that we don’t bother testing because testing them is extremely time-consuming, and they’re very unlikely to occur. This is a trade-off that we talked about as a team and decided to accept.

We work at higher levels of abstraction: We keep our tests as DRY as possible by extracting a bunch of test helpers that we write along the way, including our api testing library verity.

Speeding up Builds

It’s critical that builds are really fast, because they’re the bottleneck in the feedback cycle of our development and we frequently run a dozen builds per day. Faster builds mean faster development.

We use a linter: We use jshint before testing, because node.js is a dynamic language. This finds common basic errors quickly before we bother with running the test suite.

We cache package repository contents locally: Loading packages for a build from a remote package repository was identified as a bottleneck. You could set up your own npm repository, or just use a cache like npm_lazy. We just use npm_lazy because it’s simple and works. We recently created weezer to cache compiled packages better as well for an ultra-fast npm install when dependencies have not changed.

We measured the real bottlenecks in our tests:

  1. Loading fixtures into the database is our primary bottleneck, so we load fixtures only once for all read-only tests of a given class. It’s probably not a huge surprise to people that this is our bottleneck given that we don’t mock-out database interaction, but we consider this an acceptable bottleneck given that the tests still run extremely fast.

  2. We used to use one of Amazon EC2’s C3 compute cluster instances for jenkins, but switched to a much beefier dedicated hosting server when my laptop could inexplicably run builds consistently faster than EC2 by 5 minutes.

We fail fast: We run full integration tests first so that the build fails faster. Developers can then run the entire suite locally if they want to.

We notify fast: Faster notification of failures/defects leads to diminished impact. We want developers to be notified ASAP of a failure so:

  • We bail on the build at the first error, rather than running the entire build and reporting all errors.
  • We have large video displays showing the test failure.
  • We have a gong sound effect when it happens.

We don’t parallelize tests: We experimented with test suite parallelization, but it’s really complicated on local machines to stop contention on shared resources (the database, network ports) because our tests use the database and network. After getting that to work, we found very little performance improvement, so we just reverted it for simplicity.

Testing Continues after Deployment

We run post-deployment tests: We run basic smoke tests post-deployment as well to verify that the deployment went as expected. Some are manually executed by humans via a service called Rainforest. Any of those can and should be automated in the future though.

We use logging/metrics/alerts: We consider logging and metrics to be part of on-going verification of the production system, so it should also be considered as part of our “test” infrastructure.

Our Testing Process

We do not have a testing department or role. I personally feel that that role is detrimental to developer ownership of testing, and makes continuous deployment extremely difficult/impossible because of how long manual testing takes. I also personally feel that QA personnel are much better utilized elsewhere (doing actual Quality Assurance, and not testing).

Continuous Testing requires Continuous Integration. Continuous Integration doesn’t work with Feature Branches. Now we all work on master (more on that here!).

We no longer use our staging environment. With the way that we test via http, integrated with a database, there isn’t really anything else that a staging server is useful for, testing-wise. The api is deployed straight to a small percentage ~5% of production users once it is built. We can continue testing there, or just watch logs/metrics and decide whether or not to deploy to the remaining ~95% or to revert, both with just the click of a button.

We deploy our frontends against the production api in non-user facing environments, so they can be manually tested against the most currently deployed api.

When we find a defect, we hold blameless post-mortems. We look for ways to eliminate that entire class of defect without taking measures like “more manual testing”, “more process”, or “trying harder”. Ideally solutions are automatable, including more automated tests. When trying to write a test for a defect we try as hard as possible to write the test first, so that we can be sure that the test correctly exposes the defect. In that way, we test the test.

Lessons learned:

  • Our testing practices might not be the best practices for another team. Teams should decide together what their own practices should be, and continually refine them based on real-world results.

  • Find your actual bottleneck and concentrate on that instead of just doing what everyone else does.

  • Hardware solutions are probably cheaper than software solutions.

  • Humans are inconsistent and less comprehensive than automated tests, but more importantly they’re too slow to make CD possible.

Remaining gaps:

  • We’re not that great at client/ui testing (mobile and web), mostly due to lack of effort in that area, but we also don’t have a non-human method of validating styling etc. We’ll be investigating ways to automate screenshots and to look for diffs, so a human can validate styling changes at a glance. We need to continue to press for frontend coverage either way.
  • We don’t test our operational/deployment code well enough. There’s no reason that it can’t be tested as well.
  • We have a really hard time with thresholds and anomaly detection and alerts on production metrics such that we basically have to watch a metrics display all day to see if things are going wrong.

The vast majority of our defects/outages come from these three gaps.

It’s been over a year since I’ve been on a team that does Sprints as part of its engineering process, and I’ve really changed my mind on their usefulness. I’m probably pretty old-school “agile” by most standards, so I still think Scrum is better than what most teams out there are doing, but I think it’s lost me as a proponent for a few reasons, with the most surprising reason (to me) being its use of sprints.

The sprint, for those few engineers that might not have ever developed this way, is simply a time box where you plan your work at the beginning, usually estimating what work will fit within that time box, and then do some sort of review afterward to assess is efficacy, including the work completed, and the accuracy of the estimates.

I’ve been involved in sprints as short as a week and as long as a month, depending on the organization. The short timespan can have an amazingly transformative effect on organizations that are releasing software less often than that: it gets them meeting regularly to re-plan, it gets them improving their process regularly, and it gets a weekly stream of software released so that the rest of the company can get a sense of what is happening and at what pace. These are huge wins: You get regular course-correction, improvement and transparency, all helping to deliver more working software, more reliably. What’s not to love?

Well I’ve always ignored a couple of major complaints about Sprints because I’ve felt that those complaints generally misunderstood the concept of the time box. The first complaint is that a Sprint is just another deadline, and one that looms at a more frequent pace so it just adds more pressure. If your scrummaster or manager treats them like weekly deadlines, he/she is just plain “doing it wrong”. Time boxes are just meant to be a regular re-visiting and re-assessment of what’s currently going on. It’s supposed to be a realistic, sustainable pace so that when observed, the speed that the team is delivering at can start to be relied upon for future projections. Treating it like a deadline has the opposite effect:

  • Developers work at unsustainable paces, inherently meaning that at some point in the future, they will fail to maintain that pace and generally will not be as predictable to the business as they otherwise could be.
  • Developers cut corners, leading to defects, which lead to future interruptions, which leads to the team slowing more and more over time. I’ve been on a team that pretty much only had time for its own defects, so I’ve seen this taken to the absolute extreme.

Here’s the thing though, if it’s supposed to encourage a reliable, sustainable pace, why would you ever call it a “sprint”? “What’s in a name?”, right? Well it turns out that when you want to get people to change the way they work, and you want them to understand the completely foreign concepts you’re bringing to them, it’s absolutely crucial that you name the thing in a way that also explains what it is not.

In Scrum, it’s also common to have a “sprint commitment” where the team “commits” to a body of work to accomplish in that time frame. The commitment is meant to be a rough estimate for the sake of planning purposes, and if a team doesn’t get that work done in that time, it tries to learn from the estimate and be more realistic in the next sprint. Developers are not supposed to be chastized for not meeting the sprint commitment — it’s just an extra piece of information to improve upon and to use for future planning. Obviously naming is hugely important here too, because in every other use of the word, a “commitment” is a pledge or a binding agreement, and this misnomer really influences the way people (mis)understand the concept of sprints. Let’s face it: if people see sprints as just more frequent deadlines (including those implementing them), the fault can’t be entirely theirs.

Sooo… The naming is terrible, but the concept is a good one, right?

Well “iteration” is definitely a much better name, and I hope people use that name more and more.

More importantly though, I’d like to argue that time boxes for planning and delivery are fundamentally flawed.

I’ve personally found that sprint commitments are entirely counter-productive: A team can just gauge its speed for planning purposes based on past performance instead, which is really what the sprint commitment is meant to be. We should be forecasting entirely based on past performance, and adjusting constantly rather than trying to predict, and trying to hold people to predictions.

Also, with planning happening at the beginning of the sprint and software delivery happening at the end, in a regular cadence, the team is much less likely to do things more frequently than this schedule.

Instead of planning and releasing on the schedule of a sprint, we’ve had a lot more success practicing just-in-time planning and continuous delivery.

Just-In-Time Planning

Planning should happen on-demand, not on a schedule. Planning on a schedule often has you planning for things too far ahead, forcing people to try to remember the plan later. I’ve seen an hour’s worth of planning fall apart as soon as we tried to execute the plan, so I really don’t see the point of planning for particular time-spans (especially larger ones). Teams should simply plan for the work they’re currently starting, the natural way, not based on time, but based on the work they have to do. They should plan and re-plan a particular task as frequently as is necessary to deliver it, and they should plan for as long or as short as necessary to come up with a plan that everyone is happy with. Planning naturally like this leads to more frequent situations where developers will say “hold on, we should sit down and talk about this before we continue the current plan”, which is good for everybody involved.

Continuous Delivery

The cadence of delivering software at the end of a sprint almost always means an organization does not deploy the software more often than that cadence. I’ve worked on teams that tried to make “deployment day” somewhere in the middle of the sprint to emphasize that the sprint boundary is just an arbitrary time box, decoupled from deployment, but we never managed to deploy more often than once per sprint, and it really only resulted in a de facto change in the sprint schedule for our QA personnel. The very existence of a time-box puts people in the mindset that releasing software, deploying software and calling software “done” are all the same event and there’s a scheduled time for that. Without the time-box, people start to think more freely about ways to decouple these events and how to safely improve the customer experience more often. Today, free from sprints, I’m safely deploying to production multiple times per day.

Now Scrum proponents will argue that Sprints can be exactly like this — sprints do not preclude JIT planning or continuous delivery — and the time-box just adds additional points of planning and review to ensure that the team is doing the necessary planning and reviewing. While I wholeheartedly agree with this on a theoretical level, this is not what ends up happening in practice: In reality, when you schedule time for particular aspects of work, people tend to wait until that schedule to do that type of work.

And I realize that what I’m suggesting sounds a bit more sloppy, and lacks the formality of a scheduled cadence, but it simply ends up being more natural and more efficient. These concepts aren’t new either — any teams that have been using the venerable kanban) method will be familiar with them.

One additional Caveat

With all this said, if a team is going to continually improve, I haven’t seen a better mechanism than the regular retrospective. I just don’t think Sprints are necessary to make that happen.

This is the last part of my 3-part series on production-quality node.js web applications. I’ve decided to leave error-prevention to the end, because while it’s super-important, it’s often the only strategy developers employ to ensure customers have the most defect-free experience as possible. It’s definitely worth checking out parts I and II if you’re curious about other strategies and really interested in getting significant results.

Error Prevention

With that said, let’s talk about a few error prevention tricks and tactics that I’ve found extremely valuable. There are a bunch of code conventions and practices that can help immensely, and here’s where I’ll get into those. With that said I should add that I’m not going to talk about node-fibers, or harmony-generators, or streamline.js, which all have different solutions – ultimately because I haven’t used them (which is because none of them are considered very ‘standard’ (yet)). There are people using them to solve a few of the issues I’ll talk about though.

High test coverage, extensive testing

There should be no surprise here, but I think test coverage is absolutely essential in any dynamically typed app because so many errors can only happen at runtime. If you haven’t got heavy test coverage, (even coverage itself is valuable!), you will run into all kinds of problems as the codebase gets larger and more complex, and you’ll be too scared to make sweeping changes (like change out your database abstraction, or how sessions are handled). An untested/uncovered codebase will have you constantly shipping codepaths to production that have never been actually executed before anywhere.

The simplest way I’ve found to get code coverage stats (and I’ve used multiple methods in the past) is to use istanbul. Here’s istanbul’s output on a file from one of my npm libraries:

istanbul.png

The places marked in red are the places that my tests are not testing at all. I use this tool all the time to see what I might have missed testing-wise, when I haven’t been doing TDD, and then I try to close up the gaps.

Use a consistent error passing interface and strategy

Asynchronous code should always take a callback that expects an Error object (or null) in the first parameter (the standard node.js convention). This means:

  • Never create an asynchronous function that doesn’t take a callback, no matter how little you care about the result.

  • Never create an asynchronous function that doesn’t take an error parameter as the first parameter in its callback, even if you don’t have a reason for it immediately.

  • Always use an Error object for errors, and not a string, because they have very handy stacktraces.

Following these rules will make things much easier to debug later when things go wrong in production.

Of course there are some cases where you can get away with breaking these rules, but even in those cases, over time, as you use these functions more often and change their internal workings, you’ll often be glad that you followed these rules.

Use a consistent error-checking and handling strategy.

Always check the errors from a callback. If you really don’t care about an error (which should probably be rare), do something like this to make it obvious:

1
2
3
4
5
getUser(userId, function(err, user){
  if (err){
    // do nothing.  we didn't need a user that badly
  }
});

Even then, it’s probably best to log the error at the very least. It’s extremely rare that you’d bother writing code of which you don’t care about the success/failure.

In many cases you won’t be expecting an error, but you’ll want to know about it. In those cases, you can do something like this to at least get it into your logs:

1
2
3
4
5
6
getUser(userId, function(err, user){
  if (err){
    console.error('unexpected error', err, err.stack);
    // TODO actually handle the error though!
  }
});

You’re going to want to actually handle the unexpected error there too of course.

If you’re in some nested callback, you can just pass it up to a higher callback for that callback to handle:

1
2
3
4
5
getUser(userId, function(err, user){
  if (err){
    return cb(err);
  }
});

NB: I almost always use return when calling callbacks to end execution of the current function right there and to save myself from having to manage a bunch of if..else scenarios. Early returns are just so much simpler in general that I rarely find myself using else anywhere anymore. This also saves you from becoming another “can’t set headers after they are sent” casualty.

If you’re at the top of a web request and you’re dealing with an error that you didn’t expect, you should send a 500 status code, because that’s the appropriate code for errors that you didn’t write appropriate handling for.

1
2
3
4
5
6
7
getUser(userId, function(err, user){
  if (err){
    console.error('unexpected error', err, err.stack);
    res.writeHead(500);
    res.end("Internal Server Error");
  }
});

In general, you should always be handling or passing up every single error that could possibly occur. This will make surfacing a particular error much easier. Heavy production use has an uncanny ability to find itself in pretty much any codepath you can write.

Watch out for error events.

Any EventEmitters in your code (including streams) that emit an error event absolutely need to have a listener for those error events. Uncaught error events will bring down your entire process, and usually end up being the main cause of process crashes in projects that I’ve worked on.

The next question will be how you handle those errors, and you’ll have to decide on a case-by-case basis. Usually you’ll at least want to 500 on those as well, and probably log the issue.

Managing Errors at Runtime

Domains

Domains are without a doubt your best tool for catching errors in runtime that you missed at development time. There are three places that I’ve used them to great effect:

  1. To wrap the main process. When it dies, a domain catches the error that caused it.
  2. To wrap cluster’s child processes. If you use cluster-master like I do, or if you use cluster.setupMaster() with a different file specified via exec for child processes, you’ll want the contents of the file to be wrapped in a domain as well. This means that when a child process has an uncaught error, this domain catch it.
  3. To wrap the request and response objects of each http request. This makes it much more rare that any particular request will take down your process. I just use node-domain-middleware to do this for me (This seems like it should work in any connect-compatible server as well, despite the name).

In the case of catching an error with a domain on the main process, or a child process, you should usually just log the issue (and bump a restart metric), and let the process restart (via a process manager for the main process, or via cluster for the child process — see part I for details about this). You probably do not want to the process to carry on, because if you knew enough about this type of error to catch it, you should have caught it before it bubble up to the process level. Since this is an unexpected error, it could have unexpected consequences on the state of your application. It’s much cleaner and safer to just let the process restart.

If you’ve caught a an error from request or response object, it’s probably safe and appropriate to send a 500 status code, log the error (and stack trace) and bump a 500 metric.

Also, I’ve written a module that helps me test that my domains are correctly in place: error-monkey. You can temporarily insert error-monkey objects in different places in your codebase to ensure that domains are catching them correctly. This way, you’re not tweaking the location of your domains and waiting for errors to happen in production to verify them.

Gracefully restarting

Now that domains have given you a place to catch otherwise uncaught exceptions and you can hook in your own restart logic, you’ll want to think about gracefully restarting rather than just killing the process abruptly and dropping all outstanding requests in progress. Instead you’ll want tp restart only after all current requests have completed ( see : http://nodejs.org/api/http.html#http_server_close_callback ). This way, the process can be restarted more gracefully.

If you’re running a bunch of node processes (usually with cluster), and a number of servers (usually behind a load-balancer), it shouldn’t be that catastrophic to close one instance temporarily, as long as it’s done in a graceful manner.

Additionally, you may want to force a restart in the case that the graceful shutdown is taking an inordinate amount of time (probably due to hanging connections). The right amount of time probably has a lot to do with what types of requests you’re fulfilling, and you’ll want to think about what makes sense for your application.

With that said…

These are just my notes on what I’ve tried so far that achieved worthy results. I’d be interested to hear if anyone else has tried any other approaches for preventing, mitigating, and detecting defects.

This is the second part of a three part series on making node.js web apps that are (what I would consider to be) production quality. The first part really just covered the basics, so if you’re missing some aspect covered in the basics, you’re likely going to have some issues with the approaches discussed here.

First, I’m going to talk about a few major classes of defects and how to detect them.

I could go straight to talking about preventing/avoiding them, but detection is really the first step. You’re simply not going to get a low-defect rate by just using prevention techniques without also using detection techniques. Try to avoid the junior engineer mindset of assuming that “being really careful” is enough to achieve a low defect rate. You can get a much lower defect rate and move much faster by putting systems in place where your running application is actively telling you about defects that it encounters. This really is a case where working smart pays much higher dividends than working hard.

The types of defects you’re most likely to encounter

I’m a huge fan of constantly measuring the quality aspects that are measurable. You’re not going to detect all possible bugs this way, but you will find a lot of them, and detection this way is a lot more comprehensive than manual testing and a lot faster than waiting for customers to complain.

There are (at least) 4 types of error scenarios that you’re going to want to eliminate:

  • restarts
  • time-outs
  • 500s
  • logged errors

Restarts

Assuming you’ve got a service manager to restart your application when it fails and a cluster manager to restart a chlid process when one fails (like I talked about in Part I), one of the worst types of error-scenarios that you’re likely to encounter will be when a child process dies and has to restart. It’s a fairly common mistke to believe that once you have automatic restarting working, process failure is no longer a problem. That’s really only the beginning of a solution to the problem though. Here’s why:

  • Node.js is built to solve the C10K problem, so you should expect to have a high number of requests per process in progress at any given time.

  • The node process itself handles your web-serving, and there’s no built-in request isolation like you might find in other frameworks like PHP, where one request can’t have any direct effects on another request. With a node.js web application, any uncaught exception or unhandled error event will bring down the entire server, including your thousands of connections in progress.

  • Node.js’s asynchronous nature and multiple ways to express errors (exceptions, error events, and callback parameters) make it difficult to anticipate or catch errors more holistically as your codebase gets larger, more complex, and has more external or 3rd-party dependencies.

These three factors combine to make restarts both deadly and difficult to avoid unless you’re very careful.

Detection: The easiest way to detect these is to put metrics and logging at the earliest place in both your cluster child code and your cluster master code to tell you when the master or child processes start. If you want to remove the noise caused by new servers starting up, or normal deployments, then you may want to write something a little more complex that can specifically detect abnormal restarts (I’ve got a tiny utility for that called restart-o-meter too).

You might have an aggregated logging solution than can send you alerts based on substrings that it finds in the logs too, or that can feed directly into a time-series metrics system.

Time-outs

Time-outs are another type of failed request, where the server didn’t respond within some threshold that you define. This is pretty common if you forget to call res.end(), or a response just takes too long.

Detection: You can just write a quick and dirty middleware like this to detect them:

1
2
3
4
5
6
7
8
9
10
11
app.use(function(req, res, next){
  timeoutThreshold = 10000; // 10 seconds
  res.timeoutCheck = setTimeout(function(){
    metrics.increment("slowRequest");
    console.error("slow request: ", req.method.toUpperCase(), req.url);
  }, timeoutThreshold);
  res.on('finish', function(evt){
      clearTimeout(res.timeoutCheck);
  });
  next();
});

(or you can grab this middleware I wrote that does basically the same thing).

500s

On a web application, one of the first things I want to ensure is that we’re using proper status codes. This isn’t just HTTP nerd talk (though no one will say I’m innocent of that), but rather a great system of easily categorizing the nature of your traffic going through your site. I’ve written about the common anti-patterns here, but the gist of it is:

Respond with 500-level errors if the problem was a bug in your app, and not in the client.

This lets you know that there was a problem, and it’s your fault. Most likely you only ever need to know 500 Internal Server Error to accomplish this.

Respond with 400-level errors if the problem was a bug in the client.

This lets you know that there was a problem and it was the client app’s fault. If you learn these 4 error codes, you’ll have 90% of the cases covered:

  • 400 Bad Request — When it’s a client error, and you don’t have a better code than this, use this. You can always give more detail in the response body.
  • 401 Not Authenticated — When they need to be logged in, but aren’t.
  • 403 Forbidden — When they’re not allowed to do do something
  • 404 Not Found — When a url doesn’t point to an existing thing.

If you don’t use these status codes properly, you won’t be able to distinguish between:

  • successes and errors
  • errors that are the server’s responsibility and errors that are the client’s responsibility.

These distinctions are dead-simple to make and massively important for determining if errors are occuring, and if so, where to start debugging.

If you’re in the context of a request and you know to expect exceptions or error events from a specific piece of code, but don’t know/care exactly what type of exception, you’re probably going to want to log it with console.error() and respond with a 500, indicating to the user that there’s a problem with your service. Here are a couple of common scenarios:

  • the database is down, and this request needs it
  • some unexpected error is caught
  • some api for sending email couldn’t connect to the smtp service
  • etc, etc

These are all legitimate 500 scenarios that tell the user “hey the problem is on our end, and there’s no problem with your request. You may be able to retry the exact same request later”. A number of the “unexpected errors” that you might catch though will indicate that your user actually did send some sort of incorrect request. In that case, you want to respond with a reasonable 4xx error instead (often just a 400 for “Bad Request”) that tells them what they did wrong.

Either way, you generally don’t want 500s at all. I get rid of them by fixing the issues that caused them, or turning them into 4xxs where appropriate (eg. when bad user input is causing the 500), to tell the client that the problem is on their end. The only times that I don’t try to change a response from a 500 is when it really is some kind of internal server error (like the database is down), and not some programming error on my part or a bad client request.

Detection: 500s are important enough that you’re going to want to have a unified solution for them. There should be some simple function that you can call in all cases where you plan to respond with a 500 that will log your 500, the error that caused it, and the .stack property of that error, as well as incrementing a metric, like so:

1
2
3
4
5
6
7
var internalServerError = function(err, req, res, body){
    console.log("500 ", req.method, req.url);
    console.error("Internal Server Error", err, err.stack);
    metrics.increment("internalServerError");
    res.statusCode = 500;
    res.end(body);
};

Having a unified function you can call like this (possibly monkey-patched onto the response object by a middleware, for convenience) gives you the ability to change how 500’s are logged and tracked everywhere, which is good, because you’ll probably want to tweak it fairly often.

You’re probably actually going to use this on most asynchronous functions in your HTTP handling code (controllers?) that you call that you don’t expect an error on. Here’s an example:

1
2
3
4
5
6
7
8
function(req, res){
    db.user.findById(someId, function(err, user){
        if (err) return internalServerError(err, req, res, "Internal Server Error");
        // ...
        // and normal stuff with the user object goes here
        // ...
    });
}

In this case, I just expect to get a user back, and not get any errors (like the database is disconnected, or something), so I just put the 500 handler in the case that an err object is passed, and go back to my happy-path logic. If 500s start to show up there at runtime, I’ll be able to decide if they should be converted to a 4xx error, or fixed.

For example if I start seeing errors where err.message is “User Not Found”, as if the client requested a non-existant user, I might add 404 handling like so:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
function(req, res){
    db.user.findById(someId, function(err, user){
        if (err) {
            if (err.message === "User Not Found"){
                res.status = 404;
                return res.end("User not found!")
            }
            return internalServerError(err, req, res, "Internal Server Error");
        }
        // ...
        // and normal stuff with the user object goes here
        // ...
    });
}

Conversely, if I start seeing errors where err.message is “Database connection lost”, which is a valid 500 scenario, I might not add any specific handling for that scenario. Instead, I’d start looking into solving how the database connection is getting lost.

If you’re building a JSON API, I’ve got a great middleware for unified error-handling (and status codes in general) called json-status.

A unified error handler like this leaves you with the ability to expand all internal server error handling later too, when you get additional ideas. For example, We’ve also added the ability for it to log the requesting user’s information, if the user is logged in.

Logged Errors

I often make liberal use of console.error() to log errors and their .stack properties when debugging restarts and 500s, or just reporting different errors that shouldn’t necessarily have impact on the http response code (like errors in fire-and-forget function calls where we don’t care about the result enough to call the request a failure).

Detection: I ended up adding a method to console called console.flog() (You can name the method whatever you want, instead of ‘flog’ of course! I’m just weird that way.) that acts just like console.error() (ultimately by calling it with the same arguments), but also increments a “logged error” metric like so:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
console.flog = function(){
  if (metrics.increment){
    metrics.increment("loggedErrors");
    // assume `metrics` is required earlier
  }

  var args = Array.prototype.slice.call(arguments);
  if (process.env.NODE_ENV !== "testing"){
    args.unshift("LOGGED ERROR:");
    // put a prefix on error logs to make them easier to search for
  }

  console.error.apply(console, args);
};

With this in place, you can convert all your console.error()s to console.flog()s and your metrics will be able to show you when logged errors are increasing.

It’s nice to have it on console because it sort of makes sense there and console is available everywhere without being specifically require()ed. I’m normally against this kind of flagrant monkey-patching, but it’s really just too convenient in this case.

Log Levels

I should note too that I don’t use traditional log levels (error/debug/info/trace) anymore, because I don’t find them all that useful. I’m logging everything as an error or not, and I generally just strive to keep the logs free of everything other than lines that indicate the requests being made like so:

1
2
3
4
5
6
  ...
  POST /api/session 201
  GET /api/status 200
  POST /api/passwordReset 204
  GET /api/admin.php 404
  ...

That doesn’t mean that I don’t sometimes need debug output, but it’s so hard to know what debug output I need in advance that I just add it as needed and redeploy. I’ve just never found other log-levels to be useful.

More About Metrics

All of the efforts above have been to make defects self-reporting and visible. Like I said in the first part of this series, the greatest supplement to logging that I’ve found is the time-series metrics graph. Here’s how it actually looks (on a live production system) in practice for the different types of defects that have been discussed here.

The way to make metrics most effective, is to keep them on a large video display in the room where everyone is working. It might seem like this shouldn’t make a difference if everyone can just go to their own web browsers to see the dashboard whenever they want, but it absolutely has a huge difference in impact. Removing that minor impediment and just having metrics always available at a glance results in people checking them far more often.

Without metrics like this, and a reasonable aggregated logging solution, you are really flying blind, and won’t have any idea what’s going on in your system from one deployment to the next. I personally can’t imagine going back to my old ways that didn’t have these types of instrumentation.

I’ve been working on production-quality node.js web applications for a couple of years now, and I thought it’d be worth writing down some of the more interesting tricks that I’ve learned along the way.

I’m mostly going to talk about maintaining a low-defect rate and high availability, rather than get into the details about scaling that are covered in a lot of other places. In particular, I’ll be talking about load-balancing, process management, logging, and metrics, and the how’s and why’s of each.

Balance the Load

I’m going to assume that you’re already load-balancing on a given server with cluster or some higher level abstraction ( I use cluster-master) as well as between servers with a load-balancer like ha-proxy.

Performance considerations aside, your service will have much better availability, quality, and uptime if you’ve got multiple processes running on multiple machines. More specifically, you get:

  • the ability to immediately failover in the case of single process or single machine failure
  • reduced overall service failure in the case of single process or single machine failure
  • the ability to gracefully deploy with the load-balancer

Gracefully Deploy

Gracefully deploying means that your deploy process has enough servers running to handle the load at all times throughout the actual deployment process, and that none of the servers are taken off-line while outstanding requests are still in progress. This means:

  • Your clustering solution must be able to stop new connections and not exit the master process until all servers have finished processing their existing requests. The solution I personally use is cluster-master, but there are bunch of suitable alternatives.

  • You need a rolling deployment, meaning that servers are restarted one-at-a-time, or in small groups, rather than all at once. The easiest way to do this is probably to write a nice deploy script that takes restarting servers out of the load-balancer until they’re running again.

If you don’t have a graceful deploy solution, every deployment of new code will lose requests in progress, and your users will have a terrible experience.

Also note: I’ve seen a few clustering solutions that use some sort of hot-deploy (hot-deploy loads a new version of code without taking the server down) functionality. If you’ve got a rolling deploy via your load balancer though, you probably don’t need any sort of hot-deploy functionality. I’d personally avoid solutions that involve the complexity of hot-deploying.

Run as a Service

You’re also going to want to be running your app as a service that the OS knows to restart with some service manager like upstart. A service manager like this is going to be absolutely essentially for when your node.js app crashes, or when you spin up new machines.

It’s probably worth noting that you won’t really want to use something like forever or nodemon in production, because it doesn’t survive reboots, and is pretty redundant once you’ve added service management that actually does (This is a case where you don’t want redundancy, because these types of process managers can end up fighting with each other to restart the app, thus never really allowing the app to start).

Log to Standard Output

Logging to standard output (using console.log()) and standard error (using console.error()) is the simplest and most powerful way to log. Here’s how:

pipe it, don’t write it

In the config file for running your app, you want something like this to specify a log file:

1
node server.js >> /var/log/myserver.log 2>&1

The >> tells your node process to append output to the specified log file and the 2>&1 tells it that both standard out and standard error should go to that same log file. You don’t want to be writing to the logs programmatically from within the node process, because you will miss any output that you don’t specifically log, like standard error output from node.js itself which happens anytime that your server crashes. That kind of information is too critical to miss.

console is the only “logging library” you need

With this kind of logging set up, I just have to console.log() for any debug output that I need (usually just temporarily for a specific issue that I’m trying to solve), or console.error() for any errors I encounter.

Additionally, one of the first things I do on a new web service is to set up a console.log() for each request (this should work in any express or connect app):

1
2
3
4
5
  app.use(function(req, res, next){
    res.on('finish', function(evt){
      console.log(req.method, req.url, res.statusCode);
    });
  });

This chunk of code gives me nice simple logs for every request that look like this:

1
2
3
4
5
6
7
...
POST /api/session 400
POST /api/session 401
POST /api/session 200
GET /api/status 200
GET /api/status 200
...
rotating the logs

The missing infrastructure needed to support this is a way to rotate the logs, like logrotate. Once that’s set up properly, your logs will rotate for you nicely and not fill up your disk on you.

Tools to Help Detect Problems at Runtime

There are two basic key ways that I like to instrument an application to detect problems that occur at runtime: Aggregated logging and metrics.

Agreggated Logging

One of the most important things you can do for error and defect detection is to have aggregated logging — some service that brings all your web servers’ logs together into one large searchable log. There are a few products for this: The stand-out open source one for me seemed to be the logstash/kibana combination, though these days I’m lazy and generally use the papertrail service instead.

I would highly recommend that you set a service like this up immediately, because the small amount of friction involved in sshing into servers to tail logs is enough to seriously reduce how often you and your teammates will actually look at the logs. The sooner you set this up, the sooner you can benefit from being able to learn about your application through the logs that you have it write.

Metrics

When I say “metrics” I really mean time-series metrics, that allow me to see the frequency of different types of events over time. These are invaluable because they

  • tell you when something unusual is happening
  • aggregate certain types of data in ways that logs can’t
  • help you rally the team or company around specific high-value goals (be careful with this one!)

The stand-out metrics/graphic open source product is probably graphite. I’ve generally been using Librato’s metrics service though because it’s easy to set up, and looks great, so that’s where I’ll pull my screenshots from for time-series data. I’ve also had a pretty good experience with DataDog’s service as well. Both also come with the ability to raise alerts when a metric surpasses a threshold, which can be a nice way to know when something is going on that you should investigate.

Basic Metrics

There are a bunch of basic metrics that you can track to see what’s going on in your server.

Here’s an example of some very high-level metrics for us over a week:

  • in blue: number of requests
  • in green: average duration of requests

(Note that at this resolution, the y-values are averages over such long durations that the values aren’t really that useful at face value; it’s the visualization of general trends that is useful.)

There are a few obvious things that should jump out right away in this example:

  • We have daily traffic spikes with 5 peaks really standing out and 2 being almost flat (those happen to be Saturday and Sunday).
  • We had a spike in average request duration (on Friday morning — the third peak). This was caused by some performance issues with our database, and resulted in service outage for a number of our users during that time.

I can basically put metrics.increment("someEventName"); anywhere in my codebase to tell my metrics service when a particular event occurred.

Also consider OS metrics like: * disk space usage * cpu utilization * memory consumption * etc, etc

I’ve got my codebase set up so that metrics.gauge("someMetricName", value); will allow me to graph specific values like these over time as well.

If you’re not already doing monitoring like this, you must be wondering what your servers would tell you. When it’s this easy, you tend to do it all the time and get all kinds of interesting metrics including more business-specific metrics.

What next?

These are really just the basics. If you’re not doing these things, and you care about running a production-quality web service, you’re probably putting yourself at a disadvantage.

There is a lot more that you can do though, and I’ll get into that in my next post, which will be Part II to this one.

I’ve been reading the 37 signals book Remote lately and it’s got me thinking a lot about the advantages and disadvantages of working remotely. While they clearly acknowledge that there are trade-offs to working remotely, I personally don’t think they paint a very clear picture of just how expensive those trade-offs are. I can only really speak from a software development perspective, but I think this might apply to other types of businesses where a high degree of collaboration is necessary and the concept of “team” is more than just “a group of people”.

After 7 or 8 years of being a home-based contract-worker, and close to the same time spent working in various office environments, I’ve found that with the right people and the right environment, the office can be a lot more productive (and actually enjoyable) than a home environment, for a number of reasons.

Skype / Google Hangouts / Webex pale in comparison to actually being in the same room as someone else. I can’t explain why, because it seems to me like it should be good enough (and I really wish it were sometimes), but the fact of the matter is that it’s not. You miss nuances in communication that can add up to big differences in productivity. I’ve been in hours and hours of pair programming sessions over webex: it doesn’t match the high-bandwidth of shoulder-to-shoulder pairing. I’ve done the shared whiteboard thing: it doesn’t match the speed and simplicity of an actual whiteboard.

With communication over today’s technological mediums, there’s just a natural tendency to end the interaction as soon as possible “and get back to work” too, where people don’t actually leave Skype / Google hangouts / webex running all day long. The result is that people are reluctant to start new conversations like that because they’re worried about interrupting the other person. So people naturally fall back on less intrusive means of communication (text chat) and often even further back to more asynchronous forms of communication (email).

And asynchronous communication is not the boon to productivity that people think it is. Email is today’s definitive means of asynchronous communication, but I think it’s pretty obvious that there are few methods of communication that are less efficient. Imagine a sports team that can only communicate by email. That would be ridiculous. Email is amazing at its ability to cut down on interruptions of course, but it’s at the obvious cost of timeliness and immediacy (not to mention the very real human aspects including general tone and emotional nuances!).

The ideal solution would be to solve the problem of interruptions without losing timeliness.

Explaining this really requires actually defining an “interruption”: When you’re on a team and someone wants to talk to you about the main goal of the team, that’s decidedly NOT an interruption. Imagine a quarterback being annoyed by the “interruption” of someone yelling that they’re open for a pass. Imagine a soldier in a firefight being annoyed at another soldier for requesting cover fire prior to some forward manoeuvring. In order for something to really be an interruption, you have to be working on a different priority than the other person, one that you think is of higher priority.

In a collaborative environment, interruptions should often be viewed as a symptom, and not the actual cause, of the problem. To solve the problem of interruptions at the root, you’ve got to clearly define the priorities of the team, aligning everyone on those, and concentrating as many people as possible on the top goal rather than going the easy route (management-wise) and doing something like giving each of the X members of the team one of the top X priorities, effectively “team-multi-tasking”. Team-multi-tasking is a common approach because:

  • It’s the easiest thing to do management-wise (Why bother prioritizing when I can have X tasks worked on at once?).

  • It feels like you’re getting more done, because so many things are in progress at once.

  • It ensures everyone is 100% utilized.

But it’s also pretty obviously the absolute slowest way to get your top priority shipped and to start getting value from it (and often the slowest way to get them all done, believe it or not!).

Not only that, but the more you do this, the more people tend to specialize into roles like “the back-end guy” or the “ops guy”, etc, and individuals lose the ability to be cross-functional, and to practice collective code ownership. It’s a vicious cycle too: the more an individual tends toward deep specialization, the more we’re tempted to give them that kind of work because they’re the best at it. Not only does your bus factor skyrocket, but you get back into these scenarios where anytime someone engages someone else for help, it’s an interruption to that other person, so people tend to not ask for help when they really should, or they use dog-slow asynchronous methods (like email). Breaking this cycle means constantly trying to break down these specialty silos for any given specialty by having more experienced people collaborate with (and often mentor) less experienced people. The end result is a team that makes full-stack engineers and not just one that hires them.

I find that the right mindset for the team is to create an environment that would be best for a team working on only 1 super-important thing that you want to start getting value from as soon as possible. A general rule of thumb is: If you’re having a text conversation with someone in the same room, you’re doing it wrong. I know that’s common, but if you think about it, it’s pretty absurd.

How would that environment look? It would be a “warroom” for radically collocating the team. Everyone that needs to be there would be there, arranged in a manner that’s most effective for verbal communication. There would be nearby whiteboards for collaborative design sessions, and information radiators to show the progress of the efforts and various other metrics of the current state of affairs by just glancing up.

How would the team behave? They would be a “tiger team”. They would all be helping to get that one thing out the door. They would almost never use any electronic means to communicate (except obviously, when it’s more efficient, like sharing chunks of code, or urls), and you’d never hear someone say “that’s not my job”. If someone is the only person that knows how to do something, the team identifies them as a process choke-point and quickly tries to get others up to speed on that skill or responsibility (and not by giving them links to docs or occasional hints, but by super-effective methods like pair programming, in-person demonstration, and actual verbal discussion). If one member of the team appears to be stuck, he asks for help, and if he doesn’t, the other members notice, and jump in to help, unprompted. There are no heroes, and everyone takes responsibility for everything that the team is tasked with. This can and should be taken to extremes too: members should drop their egos and make themselves available to do grunt work or testing for other members — whatever gets the top priority shipped as soon as reasonably (and sustainably) possible. This includes making yourself vulnerable enough to ask for help from others, not only for tricky aspects of your work, but also just to divide your work up better. If you’re taking longer than expected on a given task, you should be conscious enough of that to be openly discussing it with the team, including possible solutions, and not trying to be a hero or a martyr. Conversely, you should never be waiting for extended periods of time for your teammate to do some work that you depend on for something else. If a teammate is taking longer than usual, jump in and help.

How would management work then? The team should self-organize. With clear priorities set, the team can and should, for the most part, self-direct. A manager should not try to manage the avalanche of interactions that happen during free-collaboration throughout the day. Trying to manage who works on what, will simply make that manager a choke-point and slow the process down (and he/she will inevitably be wrong more than right, compared to the wisdom of the entire team). Often a non-technical manager can have the advantage over more technical managers, because he’s forced to trust the team’s decisions and doesn’t become an “approval choke-point” for the team’s engineering decisions.

That doesn’t mean a management role is unnecessary though; it’s actually quite demanding to manage a team like this. Priorities must always be crystal clear and for the most part, the top few priorities being worked on should not change very often. If they do change often, you usually have either a management problem or a product quality problem. Those problems should be fixed at the root as well, rather than making developers feel like management or customers are just a constant stream of interruptions. Keep in mind that if you do change the top priority often enough, you can completely prevent the team from any progress at all (I’ve spent months on a team like that before — it’s not fun for anyone). (For a more complete description of the ideal responsibilities for this style of management, see “The Management” here )

Does this work for all types of people? No. But then again, you’ll never get a fast team of developers without hiring the right people. If you have people on the team that are incapable of being team-players, it’s certainly not going to work (and you’ll likely have a number of problems).

What if we have TWO top priorities though? Flip a coin and pick one; It’s that simple. Just arbitrarily choosing one will get one of them shipped and returning value as soon as possible.

What about team member X that has nothing to do? The team should recognize this and:

  • try to break down the tasks better so they can help (it’s often worth a 30 minute session to bring other people on board — Any given task can almost always be broken down further with additional design discussion, even if its not worth dividing up the work).

  • try to get them pair-programming so they can learn to help better.

  • try to get them testing the parts that are done.

Obviously in some scenarios it might make sense for them to start on the second priority, but they should be ready to drop that work at any time to help get priority #1 out the door faster. Remember: the goal is to ship finished features faster. It’s not to keep the people on the team as busy as possible.

Isn’t that really tough on the developers to rarely have quiet time to concentrate? Yes, at first it often is, often because focusing on the top priority requires discipline because it’s different from the status quo. And people simply aren’t used to the buzz in the room. The longer you spend working with headphones on, trying to carve out your own niche, separate from the rest of the team, the longer it will take you to transition and the harder that transition will be. But soon you learn to ignore the irrelevant buzz in the room and to tune back in quickly when it’s important and relevant. And since you’re all working on the same thing, it’s often relevant, and lot less difficult to get back into the flow state than you’d expect (especially if pair programming). So the pay-off for developers is:

  • huge productivity improvements

  • less waiting for someone else to do their part

  • longer sessions in “flow state” and easier returns to flow state

  • less solitude without adding interruptions

  • less “all the weight is on my shoulders” scenarios

  • more knowledge sharing and personal development

It helps a lot if developers take breaks more often, because you actually end up spending a lot more time in the flow state in a given day. You end up learning more from other developers and being vastly more productive as well, which is exhausting and energizing at the same time. In my opinion, when done right, it’s just a lot more fun than isolated development.

Of course it often also helps to have break-out rooms for smaller prolonged conversation when desired as well.

But what if you have dozens of engineers? It’s just not reasonable to try to make teams that contain dozens of engineers. Brooks’ law has seemed to hold fairly well over time and he explains that one reason is that the more people you add to a group, the more communication overhead there is. I’ve personally seen this as well, with team-size starting to get diminishing returns between 8 and 12 people, with little value (if any) to adding members beyond that. When you get beyond 8 people on a given team, you need to start thinking about creating a second team. Conway’s Law seems to dictate that a service-oriented architecture is the best solution we’re going to get for scaling up an organization’s developer head-count, but I’ve personally found that smaller codebases with distinct responsibilities are generally better codebases anyway.

With all that said, of course I really wish I could work remotely sometimes and get all the same benefits. I’ve simply never seen a telepresence technology set-up that matches the fidelity of actual face-to-face and shoulder-to-shoulder collaboration. Maybe it exists. I’d love to hear about it. But my experience simply doesn’t indicate that the current status quo technologies (skype, webex, etc), like Remote recommends, are up to snuff. The way it stands today, I’ll almost always put my money on a collocated group of “team-players” over a geographically disparate team of “gurus”.

So you’re thinking about versioning your web API somehow. Versioning is how we annotate and manage changes in software libraries, so it seems natural to re-apply this solution to changes in web APIs. Most web APIs are versioned in order to…

( A ) “freeze” an older version somehow so that it doesn’t change for older clients, or at least it is subjected to much less change than newer versions, so you can still move forward.

( B ) be able to deprecate older APIs that you no longer wish to support.

( C ) indicate that changes have occurred and represent a large batch of changes at once when communicating with others.

However…

( A ) Versions don’t actually solve the problem of moving forward despite older clients.

You can safely break backward compatibility with or without the concept of versions: just create a new endpoint and do what you want there. It doesn’t need to have any impact on the old endpoint at all. You don’t need the concept of “versions” to do that.

( B ) Versions don’t actually solve the problem of deprecating old endpoints.

At best, versions just defer the problem, while forcing you to pay the price of maintaining two sets of endpoints until you actually work out a real change management strategy. If you have an endpoint that you want to retire, you’re going to have to communicate that to all your clients that you’re deprecating it, regardless of whether or not you’ve “versioned” it.

( C ) Versions DO help you communicate about changes in a bunch of endpoints at once. But you shouldn’t do this anyway.

Of course, if you’re retiring 20 endpoints at once, it’s easier to say “we’re retiring version 1” than to say “we’re retiring the following endpoints…” (and then list all 20), but why would you want to drop 20 breaking-change notices on your clients all at once? Tell your clients as breaking changes come up, and tell them what sort of grace period they get on those changes. This gives you fine-grained control over your entire API, and you don’t need to think about a new version for the entire API every time you have a breaking change. Your users can deal with changes as they occur, and not be surprised with many changes at once (your new version).

“So how the @#$% do I manage change in my API??”

With a little imagination, it’s totally possible. Facebook is one of the largest API providers in the world, and it doesn’t version its API.

Take Preventative measures: Spend more time on design, prototyping, and beta-testing your API before you release it. If you just throw together the quickest thing that could possibly work, you’re going to be paying the price for that for a long time to come. An API is a user interface, and as such it needs to actually be carefully considered.

Consider NOT making non-essential backwards-incompatible changes: If there are changes you’d like to make in order to make your API more aesthetically-pleasing, and you can’t do so in an additive fashion, try to resist those changes entirely. A good public-facing API should value backward compatibility very highly.

Make additive changes when possible: If you want to add a new property to some resource, just add it. Old clients will just ignore it. If you want to make breaking changes to some resource, consider just creating a new resource.

Use sensible defaults where possible: If you want a client to send a new property, make it optional, and give it a sensible default.

Keep a changelog, roadmap and change-schedule for your API: If you want to solve the problem of keeping your users abreast of changes in your API, you need to publish those changes somehow.

Use hypermedia: Tell the clients what uris to use for the requests that it needs to make and give them uri templates instead of forcing them to make their own uris. This will make your clients much more resilient to change because it frees you up to change uris however you want. Clients making their own uris are tomorrow’s version of parsing json with regular expressions.

NB!

Notice that almost all of these are things that you should do regardless of whether or not you version your API, so they’re not really even extra work.

Notice also that I never said “Never make changes”. I’m not trying to make an argument against changes. I’m just saying that versioning doesn’t help you make changes at all. At best it just lets you put a bunch of changes in a “bucket” to dump on your API users all at once.

If you’ve ever seen a laundry list of the types of non-functional requirements like testability, accessibility, reliability, etc, (often called “ilities” because of their common suffix) you probably haven’t seen “rewritability”, but I really want to add it to the pile.

I’ve often found in software engineering that counterintuitively, if something is difficult or scary there’s a lot to be gained by doing it more often. I think we can probably agree that software rewrites are amoung those difficult and scary tasks.

Now that we’re starting to get better at service-oriented architecture, where networked applications are better divided into disparate services that individually have narrow responsibilities, a natural side effect is that it’s getting easier to write services that are replaceable later.

So what makes a piece of software rewritable?

Clearly defined requirements: not just what it needs to do, but also what it won’t do, so that feature-creep creeps into another more appropriate service. This one ends up requiring some discipline when the best place and the easiest place for a given responsibility don’t end up being the same service, but it’s almost always worth it.

Clearly defined interface: If you want to be able to just drop a new replacement in place of a previous one, it’s got to support all the integration points of the previous one. When interfaces are large and complicated this gets a lot harder and a lot riskier. In my opinion, this makes a lot of applications with GUIs very difficult to rewrite from scratch without anyone noticing (Maybe it’s because UX is hard. Maybe that’s because I’m terrible at GUIs.).

Small: Even with clearly defined requirements and interface, a rewrite is still going to be expensive and hard to estimate if there’s a lot to rewrite.

Any software engineers that have found themselves locked into a monolithic ball of mud application can see the obvious upsides of having a viable escape route though.

And once we’re not scared of rewrites, what new possibilities emerge? Real experimentation? Throw-away prototypes that we actually get to throw away?

How much easier is it to reason about a piece of software when it also meets all the criteria that make it easy to rewrite?

I once asked an interviewee “Ever work on a high-performing team?” and he answered “Yeah…” and then went on to explain how his team profiled and optimized their code-base all the time to give it very high performance.

That’s not what I meant, but in his defence, I really don’t think many developers have had the opportunity to work on high-performing teams and so they don’t know what they don’t know.

I’ve been developing software professionally for around 15 years and I’ve only worked on a high-performing team once. It was a couple of years ago, and it was awesome. I haven’t been able to find another since, but now I’m a true-believer in the concept of a 10X team.

I suspect that there might be groups out there somewhere who have experience with even higher-performing teams, but I’m certain that I’ve been on enough teams now that high-performing teams are extremely few and far-between. I’m not sure I even know how to create one, but I know one when I see one. I’ve definitely been on teams since that have been close to being high-performing.

The high-performing team is the ultimate win-win in the software development industry because the developers on the team are thrilled to be going into work on a daily basis and the business is thrilled because they’re producing results much faster and more reliably than other teams. Those are really the two essential criteria, in my opinion:

  1. Is everybody happy to be part of the team?
  2. Is the team producing results quickly? The first criteria is absolutely necessary for the second to be sustainable. Any business that neglects developer enjoyment is taking a short-sighted path that will eventually negatively impact performance (due to burnout, employee turnover, etc).

So all of this sounds great, but the real mystery is in how to make it happen. I’m not 100% sure that I know. I don’t think I’ve ever even heard of people talking about how to achieve it. This is my shot at describing the recipe though.

The Management

The craziest part of my experience is that the only time I’ve seen a high performance team happen was when the manager of the team had very little software development experience. That seems maddeningly counter-intuitive to me because from a developer’s perspective, one of the most frustrating aspects of software development is having a clueless person telling you what to do and how to do it. It turns out that being clueless in software development can be a huge benefit to being brilliant at software development management though, if the manager is already brilliant with management in general. This manager was definitely brilliant on the management side. Here’s why:

  • He hired for attitude over aptitude. He would hire based on peoples’ potential, their attitudes, and their willingness to work hard instead of whether their precise skillset filled a niche that we needed.
  • He would cull developers who weren’t team players. It’s not a job anyone wants to do, but it’s sometimes necessary to let people go when they’re not working out. Not doing this results in the rest of the team being miserable.
  • He kept the team small, and didn’t move people in or out of it very often. This gives the team the maximum opportunity to form into a cohesive unit.
  • He would set clear goals and priorities. No two things would share the same priority (even though I’m sure he was probably secretly flipping a coin when more than one thing was extremely important). That would give the team a clear direction and an ability to focus on what’s most important.
  • He never explained a requirement with “because I said so”. This encouraged a culture of questions and feedback which enhanced learning both on the manager’s part and on the team’s part.
  • Even when priorities would change, he’d let us complete our current work where possible, so that we wouldn’t have a mess of incomplete work lying around to try to come back to later. I’ve been on teams since where the team has been repeatedly interrupted to the point of not really delivering anything for months.
  • He assigned work to the team, and not to individuals, so the team would have to own it collectively. This kept the manager from being the choke-point for “what should I work on next?” questions.
  • He didn’t make decisions or solve problems for the team that the team was able to handle themselves (even when individuals on the team would ask).
  • He’d bring us the requirements, but leave the solutioning up to us. The result was that the team was largely autonomous and in control of how it accomplished its goals, so we immediately had a newfound sense of responsibility for our work and even pride in it.
  • He’d get estimates from us, and not the reverse (ie. deadlines). This way he could inform the business about what the estimated cost and schedule for a given feature was so they could make informed decisions about what features to pursue, rather than just delivering what was finished when the clock ran out.
  • He’d actively look for scenarios where the team was not “set up for success” and rectify them. Nothing is worse for team morale than a death march project, even a mini-one.
  • He held the team responsible for the performance of individuals. If someone on the team was struggling on a task (daily stand-up meetings make this pretty obvious) without any help from team, he’d ask the team what they were doing to help.
  • All of these things created the perfect climate for a high-performing team, but no one person alone (even a great manager) can create a high-performing team on their own.

The Team Member

The individuals on the team were also really impressive to me and no doubt absolutely essential to making the team what it was. Here’s a loose laundry-list of the qualities I witnessed and tried to adhere to myself:

  • Members of the team were open, honest, and respectful. They would give each other feedback respectfully and truthfully and would never talk behind each others’ backs. On other teams that I’ve worked on where this wasn’t the practice, team-crushing sub-cliques would form.
  • Once a decision has been made by the team, every member of the team would support it (even if he/she was originally opposed) until new evidence is presented. This actually worked well because we would always first take the time to actually listen to and consider everyone’s opinions and ideas.
  • Members of the team were results-oriented. “Effort” was never involved in our measure of progress. Software actually delivered according to the team’s standards was our only measure of progress.
  • The team would continuously evaluate itself and its practices and try to continuously improve them. We’d have an hour-long meeting every week to talk about how we were doing, and how we could improve (and how we were doing implementing the previous week’s improvements).
  • The team would practice collective code ownership, and collective responsibility over all the team’s responsibilities. This is actually a pretty huge deal; if people working on a top priority needed help, they’d ask for it immediately and anyone that could help would. If someone was sick, someone else would take on their current responsibilities automatically. If a build was broken, everyone would drop everything until there was a clear plan to fix it. The morning stand-up meeting would happen at exactly 9 am regardless of whether the manager was there.
  • The team would foster cross-functionality as a means of achieving collective code ownership and collective responsibility. What that meant was that members would actively try to teach each other about any particular unique skill-sets they had so that they would be redundant and team members would be better able to take on any task in the work queue.
  • The team would be as collaborative as possible. We all sat in a bullpen-like environment where we’d torn down the dividers that were originally erected between us. We’d often pair-program, not only for skill-sharing and mentoring, but also for normal coding tasks, because the boring tasks are less mind-numbing that way and the complicated tasks get better solutions that way.
  • In the 5 or 6 teams in our organization at the time, I think our team was by far the fastest, and most capable of handling the more technically diverse and challenging work. No one on the team was a rock-star programmer, but somehow despite that, the team as a whole was really outstanding.

Eventually when this manager moved on, I took the management role and ultimately failed to maintain that environment (partially for a number of reasons that were out of my control, and partially because at the time I didn’t realize the necessary ingredients). I’ve only really learned them since then by watching dysfunctional teams.

It’s a cliche, but in the right environment, the collective can indeed be greater than the sum of its parts.

EDIT (June 2nd, 2014):

I just moved my blog over from wordpress, but I don’t want to lose the comments that I had there, because they’re great and they’re from the team that I was writing about. I’m just going to copy/paste them in here for posterity.


Todd said on May 8, 2013 at 8:12 am: Edit

You nailed it! I’m glad my simple ‘I miss working with you…’ email triggered you to write this thoughtful post. I admire your choice to spend time sharing your thoughts with the world.

I was trying to come up with some clever addition to your post but I keep typing and erasing. The one constant thought I whole heatedly believe had the most impact on our teams success is ‘Jaret’. I sure hope he doesn’t read this post, they’ll be no living with him! :)


Tom said on May 10, 2013 at 7:50 pm: Edit

Gregg and Todd, I also miss working with you guys. I do, however, have some pertinent hymns to add to the post. Though, I must first agree that a highly productive team is not formed from a cookie-cutter, but many ingredients do add to the potential success; kudos to Gregg for identifying most.

Management: I’d add that the “ideal” manager is an advocate for the team’s decisions, approach and methodology even if he does not fully understand them. Specifically, if the team can convince, through reason, that they were not smoking crack and that their solution would be suitable and economical (perhaps not in the short-run, but in the long-run), then he would go to bat for them. He’d do this no matter how much counter-pressure was exerted by upper management; clearly JW did this.

Secondly, a culture of experimentation, collaboration, and trust must trickle down from the upper echelons of management down to the lowest levels. Think of this as an approach to lowering entropy. Without this, all middle managers have their hands tied.

Team: Diversity in cultural and technological background is highly advantageous. The simple reason is that varied backgrounds allow differing opinions, analysis and approaches to a solution. I’d be bold enough to state that a homogeneous team is useless at being highly-productive because there is not enough diversity in the “knowledge gene pool” to forge great solutions. On the flip side, too much diversity may be a hindrance as agreement may be hard to come by for the final solution, but somebody should play devils advocate during the initial discussions; this only comes from varied experiences.

As well, team members should feel as if they are creating something of value. Perhaps this is more of a company or management issue, but I believe if people, regardless of their position in a company, feel like they are contributing to a greater good will naturally help their teammates and this will improve the productivity of the team as well as the firm’s bottom line.

I’m sure there are more comments to add, but I need to go have another glass of wine and reminisce :)


Gregg Caines said on May 11, 2013 at 8:12 am: Edit

Ha Good to hear from you Sykora :) I agree with your points. I think JW was set up for success when the “upper echelons” were supportive, and we were all destined for failure once those winds changed and upper management suddenly cared very deeply about the team’s every move on a day-to-day basis. Something makes me think that even if they had a reasonable ratio of right/wrong in their technical decisions, we’d still be irritated to be completely out of control of our own destinies, especially given our history of self-organization. Once the team gets a taste of autonomy, it’s hard for them to go back to command-and-control management. I suspect you of all people would agree. ;)

I agree with you on diversity too. We had a pretty diverse team but I think it worked well because we were respectful and trusted each other. I’ve seen teams since that were much less diverse and would argue disrespectfully about tiny things compared to some of the things that we would (calmly) discuss and (quickly) resolve. I’ll tell you one thing that I know for certain now: egos don’t make good team-mates. ;)

I agree with the “creating something of value” point as well, but I’m less about the value to the customer and more about the engineering craftsmanship. It’s the only way I’ve found myself able to reconcile product decisions I can’t agree with. Maybe one day I’ll have my greasy hands in the product side as well and get a chance to care all the way through though. ;)


tibi said on May 11, 2013 at 9:11 am: Edit

Well that was a nice post. And now I miss the team even more. I think you hit it right on the head (I can’t correctly remember the nail euphemism :–) )…

I’ve been looking for the same dynamics in a team ever since, and I think I’m getting close, but there is always something missing. Perhaps it’s the Tim Horton’s cup arch…


Jeff said on May 11, 2013 at 3:53 pm: Edit

Some great thoughts here Gregg – on the money pretty much across the board. A team is definitely only as good as it’s leadership allows. I’ve been in situations where the core team was incredible and worked together well but ultimately failed for the various reasons you any Tom both mention.

p.s. Hiya boys – hope everyone’s doing well. Did someone say Tim’s?


Gregg Caines said on May 13, 2013 at 9:01 am: Edit

Tibz and Jonesy! Good to see you two are still alive. :)


Chris said on May 14, 2013 at 7:06 am: Edit

It was something special to watch and interact with.

You do not really know what you’ve got until it’s gone, eh? There was a sense of what was happening but probably also a sense that it would fairly easily be recreated.

My current co-workers are tired of hearing how “we would have did it at my last place.” ;) Somehow it is not fair to hold other teams to that yardstick.


JW said on June 10, 2013 at 8:56 am: Edit

So many great things have been said above – and I have to agree with all of it. From my perspective, it really was a dream team. The trust, collaboration and commitment levels in the entire office were exceptional.

I have read numerous books on leadership and when I reflect on that team, we had all the characteristics of a high performing team.

For me, I loved the level of achievement, craftsmanship, and fun that we had. It wasn’t work – even a few of those late nights weren’t too bad because we were doing it together.

Is there a secret sauce to recreate that environment? Not that I have found. However, I have been lucky enough to be part of 4 different high performing teams over my career, and I believe there are things we can all do to help create an environment that could support a high performing team. From there – it is up to the team to choose to be high performing.

I identify well with the Patrick Lencioni’s book “The 5 Dysfunctions of Team”. Although I will not rewrite each of the characteristics from the book, I feel that we need to provide special mention to trust. It is the foundation of a high performing team. Nothing can be built without it. That means we need to know our team mates and their goals. How can we help each other? The more we help and recognize each other, the quicker the team will want to collaborate and achieve greater things.

We are all lucky to have been part of something great. I learned a ton from everyone there and I share those insights in my classes everyday.

Gregg – many thanks for starting this thread.