One of my favourite ways to tackle tech debt is to fix it as I work through product requirements. There are some great advantages to it:

  • It automatically prioritizes tech debt burndown in areas that you’re probably going to touch again soon, which is where it’s most valuable.
  • It doesn’t require blocking any product development flow. It just slows it down (probably imperceptibly).
  • It doesn’t even require conversations outside of engineering.

I’d hazard to say this is probably considered a best-practice, so people making technical improvements this way are in good company. I call this “the organic method” because improvements are happening naturally, or “organically” as other changes occur.

There are some downsides though (especially with large codebases with many affected developers):

  • It hides the actual cost. Is it really better for product development to be imperceptibly slower? Wouldn’t it be nicer if costs were more explicit and obvious?
  • It’s a lot easier to do tech improvement and product development separately. Doing two things at once is almost always more complicated.
  • It’s easier to find patterns, better solutions and shortcuts for even fairly mechanical technical improvement work if you’re focusing only on that technical improvement work for some period of time.

Usually the biggest downside is that it’s slower.

In practice, I always find that it’s much much slower than you’d think. Here’s a graph of a Javascript to Typescript conversion effort that I’ve been tracking for the past 11 months:

burning down technical debt

There are 2 small steep declines here that show the efforts of single individuals for short periods of time, but otherwise this graph (spanning almost a year) is a 64-file improvement out of 223 files in 11 months. At that rate, the effort will take 3.5 years.

I’ve tracked a number of similar efforts over the last year and the results are similar. My previous experience with organic improvement in large codebases feels pretty similar too: Without specific mechanisms to keep the conversion going, it naturally slows (or worse, stops).

Why does it matter if it’s slower?

Maintaining the incomplete state of the conversion is sneakily very expensive:

  • It’s harder for newcomers to learn the right patterns when multiple exist
  • Engineers need to remember all the ongoing efforts that are underway and always be vigilant in their work and in their code reviews of others work
  • Diffs are easier to understand when they don’t try to do too many things at once
  • Copy/pasta programming, forgetfulness, and uneducated newcomers lead to contagiousness; propagation of the wrong pattern instead of the right pattern
  • When you’re really slow, you’re even more likely to have multiple of these efforts underway at once, compounding the complexity of fixing them
  • If you’re slow enough and not actually doing cost-benefit analysis, patterns can be found that are “even better” than the ones underway. This is how you end up with even more ways to do the same thing and a smaller number of engineers that find joy in working in that codebase.

Most importantly though, if there’s really value in paying for that technical improvement, why not pay it sooner rather than later?? Ironically, most of the least productive (and least fun) codebases I’ve seen are because of people making numerous actual improvements but then leaving them only partially applied. Good intentions without successful follow-through can easily make the code worse.

For larger technical improvements (ones that affect too many files to pull off in a week or less) you want to make sure that:

  • You have a vague idea of the cost and you’re actually making an improvement that you think will be worth the cost.
  • The timing for doing it now is right (and there isn’t something higher value you could do instead)
  • You actually have a plan that converges on total conversion in a reasonable amount of time instead of something that just leaves the codebase in an inconsistent state for an extended period of time.
  • The goal, the timing and the plan are generally approved by your teammates (even if unanimity is impossible)

Once you’ve got those 4 factors in place, you’re probably better off in the long run if you capitalize on the improvement as quickly as possible. You probably don’t want to cease all product development for a huge amount of time, or send one developer hero off to fix it all, but you’ll probably want to come up with something better than organic improvement too, if you really care about that improvement.

In my experience, cross-functional teams align people to business goals best, and so they can get to real results much faster and much easier than teams made up of a single function. They really don’t seem to be that popular, so I thought I’d talk about them a bit.

Here’s some common chatter across mono-functional teams:

The Engineering team:

  • “We should never be taking on tech debt. Tech debt slows us down!”
  • “We should stop everything and clean up all our tech debt, regardless of cost or current business goals”
  • “We should convert all our code to use this new framework everywhere because it has [INSERT TODAY’S LATEST DEVELOPMENT FAD]”
  • “It’s the testers’ job to test, not mine”
  • “Works on my machine!”
  • “Let ops know that I used the latest version of the database client library, so they’ll have to upgrade all the databases”

The Testing team:

  • “Let me see if I can find a reason to stop this release”
  • “We need X more days to test before the release”

The Frontend/Mobile/Ios/Android team:

  • “That bug is on the backend.”

The Backend Team

  • “That bug is on the frontend.”

The Operations Team

  • “We haven’t got time to do the release today. Let’s schedule something for early next week.”
  • “Engineering doesn’t need access to that. Just tell us what you need.”

The Design Team

  • “We don’t want to help with a quick and dirty design for that feature experiment. It doesn’t fit into our vision”
  • “We’ve got the new total redesign for this month specced out.”

The Product Management Team

  • “That technical debt burndown can wait, right?”
  • “We should do this the fastest way possible.”
  • “Here are the detailed specs of what I want you to build. Don’t worry about what problem we’re trying to solve.”
  • “I’ve finally finished our detailed roadmap for the year.”

Do you see the patterns?

These teams…

  • optimize for the areas of their specialization, not for the business’ goals or for other teams’ goals.
  • defend their area of specialization by hoarding power and information
  • constantly try to expand their area of specialization at the expense of the business’ goals
  • focus more on looking busy than getting real business results
  • push blame to others
  • too willingly take on expensive projects where others pay the majority of the costs or where the value isn’t aligned with company goals.

So what to do instead?

Well you get these mono-functional teams because someone talking about a specialty or discipline once said something like “X is important. We should have a team for X.”

My suggestion instead is simply to start saying “X is important. We should have X on every team.”

This leads to a team with a bunch of different but cooperating specialties. The only thing they all have in common is their team’s portion of the business’ goals.

Think of it this way:

  • If the members of a team don’t share their goals can they really even be called a team?
  • Why would you give goals to a team without also empowering them with all the specialist skillsets and ability to also deliver on those goals?

In general I’ve found that only a cross-functional team can make the proper trade-offs on its own, react quickly to changes in the world, and execute with minimal communication overhead. Once it has all the specialties it needs to autonomously deliver on its goals, you’re set up for a whole new level of speed of execution.

I’m not saying that cross-functional teams solve all the issues above, but they make the conversations happen on-team where it’s much cheaper than across teams, and the conversations are much easier because people don’t have to guess each other’s motives nearly as much.

It’s not any easy transition either if you’re currently on mono-functional teams. In my experience though, cross-functional teams can really make mono-functional teams look like like a morass of endless disagreements and easily avoidable meetings.

Too often when I see a team trying to replace a bad/old/deprecated pattern that is widespread in a codebase, they default to what I call The Hero Solution: One person on the team goes through and fixes every case of it themselves.

This can work for very small efforts, but it’s almost always a terrible solution in larger efforts for a few reasons:

  • When the bad pattern is widespread this is the slowest way to fix it and the slowest way to get to value from the new pattern.
  • There’s nothing in this policy that stops other people from continuing to add the bad pattern. Indeed there will often be code with the bad pattern that they want to copy/paste/modify, making the bad pattern almost contagious.
  • Teammates may be working in the same areas of the codebase causing merge conflicts that slow both the teammate and the hero down.

Here are a few tips that will see better results:

Track the bad pattern

Find a way to track instances of that bad pattern over time. Often a simple git grep "whatever" | wc -l will tell you how many cases of it you have in the codebase. Check often and record the values. Whatever your strategy is, if it’s not trending toward 0 in a reasonable timeframe, your strategy is not good enough. Come up with something else.

I can’t tell you how many cases I’ve seen of efforts trending toward multiple years (sometimes as much as 10 years) as soon as I started measuring over time, determining the rate of change, and extrapolating the completion date.

If you do nothing else, do this. You’ll quickly be able to see the cost (in time) and be able to reassess the value propsition.

Stop the spreading!

Agree with the team on a policy that no new instances of the bad pattern will be added without a team-wide discussion. Add pre-commit hooks that look for new cases (git grep is awesome again) and reject the commit. Look for new cases in Pull Requests. Get creative! Without this, you’re basically bailing a leaking boat without patching the leak and you will have:

  • People that want to copy/paste/modify some existing code that has the bad pattern
  • People that don’t even know the pattern is now considered bad
  • People that knowingly implement the bad pattern because they don’t know about the good pattern.

If you can’t get your team mates on board, your effort is doomed. I’ve seen bad patterns actually trend upwards toward infinity when no efforts have been taken to get consensus or stop the bad pattern.

NB: Because of the regular human effort involved in maintaining consensus and educating and re-educating people about the goals, one of the best ways to stop the spreading is to concentrate on faster total conversion to the new pattern. Having bad examples all over your codebase works against you on a daily basis. Bad patterns are contagious.

Get your team mates involved in the conversion!

Here are a few ideas:

  • Have a rule where no modified files (or functions or modules or whatever doesn’t feel too aggressive for your team) can contain the bad pattern anymore. Figure out ways to enforce this automatically if possible, or in code reviews if not.
  • Break the work into chunks and schedule those pieces within other product work on a regular basis. This is sort of a nuclear option, but if the chunks are small enough (yet still encompass the entire scope of the conversion), you can show regular and reliable progress without stopping production work for any extended period of time.
  • Get other people to help with the conversion! If people are bought into it, there’s no reason one person should be doing it alone. Multiple people working on it (in a coordinated fashion) will reduce merge conflicts with product work, and increase knowledge sharing about the proper pattern. You may even get better ideas about how to convert.

Don’t do things that don’t work.

Stuff that doesn’t work:

  • Efforts that the team as a whole doesn’t find valuable / worth the cost.
  • Efforts that are ill-timed. Should you really do it now? Is this really the most important thing?
  • Efforts that are not tracking toward 0 in a reasonable amount of time. Partial conversions are really hard to manage. They may not be a strictly technical concern, but they are a concern for on-boarding, managing, complexity/mental-overhead, knowledge-sharing, etc. Come up with a strategy that doesn’t prolong them.

Big problems need smarter solutions!

I always try to think about what exactly it is about experience that makes a software developer better. Are there aspects of experience that are teachable, but we don’t understand them yet well enough to teach them? Can we do better to pass on “experience” rather than have every developer suffer through the same mistakes that we suffered through?

I know I’ve had a couple of hard-won lessons over the years that really helped me be more successful in software engineering. I’ve been able to see the warning signs of mistakes to avoid for a long time, but I think only recently I figured out the reasons behind those warning signs. And I think some of the reasons can be explained mostly in terms of probability math, which can then be taught right?

Before I go into this, I’d like to preface this by saying I’ve failed many many math classes. I hardly ever use any advanced math in my work, and doubt many other programmers do. I wish I had more patience and learned more math (especially how to apply it) too. So with that said, here goes…

Lesson: The Certainty of Possibility at Scale or over Time

The first piece of experience I’d like to pass on is that your “flawless” solution to a problem, at scale, over time, will fail.

I don’t know how many times I’ve heard an engineer say something like “This solution is basically bulletproof. There’s only a 1 in a million chance that that corner case you just mentioned will occur”, and then promptly put the solution into an environment that does 1 billion transactions a month.

Here’s how the math looks:

1
2
3
1B transactions/month * 1/1M probability of failure per transaction
= 1B / 1M failures per month
= 1000 failures per month.

Yes, the mathematics of probability are telling us that that particular solution will fail roughly 1000 times a month. This is probably a “duh” conclusion for anyone with a computer science degree, but I see developers (yes even ones with computer science degrees) failing to apply this to their work all the time.

At scale and over time, pretty much everything converges on failure. My favourite thing to tell people is that “At this scale, our users could find themselves in an if (false) { statement.”.

So what does this mean? Well it doesn’t mean everything you do has to be perfect. The downsides of failure could be extremely small or completely acceptable (In fact actually pursuing perfection can get really expensive really quickly). What this tells you though is how often to expect failure. For any successful piece of software, you should expect it often. Failure is not an exceptional case. You have to build observable systems to make sure you know about the failures. You have to build systems that are resilient to failure.

Often I hear people talking about solution designs that have high downsides in the case of failure with no back-up plan and defending them with “But what could go wrong?”. This is a tough one for an experienced developer to answer, because having experience doesn’t mean that you can see the future. In this case all the experienced developer knows is that something will go wrong. Indeed when I’m in these conversations, I can sometimes even find one or two things that can go wrong that the other developer hadn’t considered and their reaction is usually something like “Yeah, that’s true, but now that you mentioned those cases I can solve for them. I guess we’re bulletproof now, right?”.

The tough thing about unknown unknowns is that they’re unknown. You’re never bulletproof. Design better solutions by expecting failure.

Understanding this lesson is where ideas like chaos monkey, crash-only architecture, and blameless post mortems come from. You can learn it from the math above, or you can learn it the hard way like I did.

Lesson: Failure Probability Compounds

Here’s the second piece of mathematically-based wisdom that I also learned the hard way instead: If you have a system with multiple parts relying on one another (basically the definition of a system, and any computer program written ever), then the failure rate of the system is a multiple of the failure rates of the individual components.

Here’s an almost believable example: Let’s pretend that you’ve just released a mobile app and you’re seeing a 0.5% crash rate (let’s be wildly unrealistic for simplicitly and pretend that all the bugs manifest as crashes). That means you’re 99.5% problem-free right? Well what if I told you the backend has an error rate of 99.5% too and it’s either not reporting them properly or your mobile app is not checking for those error scenarios properly?

Probability says that you compute the total probability of error by multiplying the two probabilities, ie:

1
99.5% X 99.5% = 99%

What if you’re running that backend on an isp and that’s got a 99.5% uptime? And your load balancer has a 99.5% success rate? And the user’s wifi has 99.5% uptime? And your database has a 99.5% success rate?

1
99.5% X 99.5% X 99.5% X 99.5% X 99.5% X 99.5% = 97%

Now you’ve got a 3% error rate! 3 out of every 100 requests fails now. If one of your users makes 50 requests in their session, there’s a good chance that at least one of them will fail. You had all these 99’s and still an unhappy user because you’ve got multiple components and their combined rate of error is the product of each component’s error rate.

This is why it’s so important when ISPs and Platforms-as-a-service talk about uptime with “5 nines” or 99.999%. They know they’re just one part of your entire system, and every component you have in addition has a failure rate that compounds with their baseline failure rate. Your user doesn’t care about the success rate of the components of the system — the user cares about the success rate of the system as a whole.

If there’s anything at all that you should take from this, it’s that the more parts your system has, the harder it is to keep a high rate of success. My experience bears this out too; I don’t know how many times I’ve simplified a system (by removing superfluous components) only to see the error rate reduce for free with no specific additional effort.

Lesson: When faced with high uncertainty and/or few data, the past is the best predictor of the future.

We saw in the last lesson one example of how surprising complexity can be. The inexperienced developer is generally unaware of that and will naively try to predict the future based on their best understanding of a system, but the stuff we work on is often complex and so frequently defies prediction. Often it’s just better to use past data to predict the future instead of trying to reason about it.

Let’s say you’re working on a really tough bug that only reproduces in production. You’ve made 3 attempts (and deployments) to fix it, all of which you thought would work, but either didn’t or just resulted in a new issue. You could think on your next attempt that you’ve absolutely got it this time and give yourself a 100% chance of success like you did the other 3 times. Or you could call into question your command of the problem entirely and assume you’ve got a 50/50 chance, because there are two possible outcomes; success or heartache. I personally treat high complexity situations like low data situations though. If we treat the situation like we don’t have enough information to answer the problem (and we’ve got 3 failed attempts proving that that’s the case), we can use Bayesian inference to get a more realistic probability. Bayesian inference, and more specifically, Laplace’s Law tells us that we should consider the past. Laplace’s Law would say that since we’ve had 0 successes so far in 3 attempts, the probability for the next deployment to be a success is:

1
2
3
4
  = (successes + 1) / (attempts + 2)
  = (0 + 1) / (3 + 2)
  = 1 / 5
  = 20 %

This is the statistical approach I use to predict too. With no additional information, if I’ve failed at something a number of times, the chances of me succeeding in the future are reduced. I don’t use this data in a depressingly fatalistic way though — I use it to tell myself when it’s time to make more drastic changes in my approach. Something with a 20% chance of success probably needs a much different, more powerful approach. I also stop telling people “I’m sure I’ve definitely got it this time”.

Similarly, if there’s an area of the codebase that has received a comparatively large number of bug-fixes, I’m going to lean towards it being more likely to have bugs than the areas having less bugfixes. This may now seem obvious, but if you’re the person that did all those bugfixes, you may be biased towards believing that after all those fixes, it must be less-likely to still be buggy. I’d happily bet even money against that, and I’ve actually won a lot of money that way.

I think it’s fair to say that this is extremely counter-intuitive and may on its face look like a form of the Gambler’s fallacy, but remember it’s for use in low-information scenarios, and I would say that numerous defects are clear evidence you didn’t have enough information about the system to predict this bug in the first place.

Relatedly, Extreme Programming has a simplified statistical method of estimating how much work a team can handle in a sprint called “Yesterday’s Weather”. Instead of having a huge complicated formula for how much work a team can handle in a given timeframe, they simply look at what the team was able to accomplish last sprint. It’s going to be wrong a lot of course, but so is whatever huge complicated formula you devise.

If there’s a more generalized lesson that you should take from this, it’s that we work with complex systems, they’re notoriously difficult for humans to predict, and predicting statistically can often get you a clearer picture of the reality of a situation. Resist prediction from first principles or through reasoning. You don’t have enough data, and software systems are normally elusively complex.

With all this said, I know it’s pretty hard to really get a feel for the impact of probabilities until you’ve had the prerequisite defects, failed deployments, and product outages. I sure wish someone had at least taken a shot on telling me in advance though. It also makes me wonder what I’ve still got left to learn.

The 4th criteria of the Joel Test of quality on a software team is:

  1. Do you have a bug database?

You probably do. The test was written almost 20 years ago and if I recall correctly almost everyone had one anyway. Joel was even selling one.

On the surface it seems like common sense. Quality is something you want to manage. So naturally you’ll also tend to want to log things… measure things… track things.

I want to propose a better way for most situations though. It’s simpler, your long-term speed will be faster, and your product quality will be higher:

The Zero Defect Policy

Simply put: Prioritize every defect above feature work or close the issue as a WONTFIX. I’m not suggesting you interrupt any work already in progress, but once that’s done, burn your defect list down to 0.

Why in the world?

  • Bugs are much cheaper to fix immediately. You know the product requirements most clearly at that time. You were touching that code most recently at that time and it’s still fresh in your mind. You might even remember a related commit that could have been the cause. The person who wrote it is still on the team.

  • Most things we’d classify as bugs are really some of the most obvious improvements you could make. You probably don’t need to a/b test a typo fix. If your android app is crashing on the latest Samsung phone, you don’t need a focus group.

  • Managing bugs is a huge effort. If you’re not going to immediately fix them, you have to tag them, categorize them, prioritize them, deduplicate them, work around them, do preliminary investigations on them, revisit them, and have meetings about them.

  • The development team needs the proper immediate feedback and back-pressure to know when to speed up and when to slow down.

Say What now?

Defects are perfectly normal. If you don’t have them, you’re either NASA or you’re developing too slowly. However, if you have too many, or if they’re particularly costly, you absolutely need to slow down and turn them into learning opportunities. In those times, the fix isn’t enough. The fix with an automated test isn’t even enough. You’ll want to look into: * other prevention measures * faster detection measures * faster ways to fix * better ways to limit impact

This is the REAL definition of Quality Assurance. If you do this thoughtfully, and don’t try to make an army of manual testers the best solution you could come up with, over the long term you’ll be much much faster. You’ll be the kind of product development team that the company actually believes when you say something is “done”.

What about the low value, high cost bugs?

Delete them. If they become worthwhile later, you’ll hear about them again. If you can’t bring yourself to delete it, you probably value it too much to not fix it. Just fix it.

What about when my team ends up with a week’s worth of defects and can’t get any feature work through?

There will definitely be dark times. Slow down and learn from them. It’s the best time to talk about speed improvements, because speed and quality are interdependent. You can move much faster in mistake-proof environments. In most cases, nobody cares how fast you ship broken software.

Sounds great, but what about our existing bug database of hundreds of bugs?

Many are probably not even bugs anymore if you’ve been collecting long enough to get hundreds. How many of those old things can you even reproduce? How many have clear enough descriptions that you can even still understand the problem? Is the original submitter still around? Here’s my solution: delete all but the highest priority ones and immediately schedule the high priority ones above other features. Allow people to file reports again if they really care about a particular defect and the defect still exists.

The bug database is great in theory, but in practice, it’s often an aging garbage heap of ignored customer frustrations and a /dev/null for opportunities for improvement.

I’ve had a chance to get back to coding after a couple of months’ hiatus so I thought I’d write about something a bit more fun and way down in the details.

Conditionals add complexity — they seem like a tiny bit of complexity in the singular case, but they really do add up quickly to make code unpredictable and difficult to follow.

Your debugger and your unit tests are probably great at managing that complexity, but they’re really kind of a crutch. You want to remove complexity, not manage it.

As a result I have a bunch of rules around conditionals that I generally always try to follow.

Early-Return is Simpler

Probably my most important personal rule is to use early-returns wherever possible. In general, it’s not simpler to just have a single return at the bottom of a function when you can return earlier in some circumstances. If you know a value shouldn’t be changed after a certain point in a function you should just return right there. There are two reasons it’s simpler: * You don’t have to try to reason about what might still happen later * Subsequent reading of the code (by you or others) can be much faster because you can stop reading the code for a function as soon as the case you’re investigating hits a return.

1
2
3
4
5
6
7
8
9
10
11
12
13
function fizzMyBuzz(i) {
  var output;
  if (i % 15 == 0) {
    output = "FizzBuzz";
  } else if (i % 3 == 0) {
    output = "Fizz";
  } else if (i % 5 == 0) {
    output = "Buzz";
  } else {
    output = i;
  }
  return output;
}

VS

1
2
3
4
5
6
7
8
9
10
11
12
function fizzMyBuzz(i) {
  if (i % 15 == 0) {
    return "FizzBuzz";
  }
  if (i % 3 == 0) {
    return "Fizz";
  }
  if (i % 5 == 0) {
    return "Buzz";
  }
  return i;
}

There’s a visible difference in density here. The second example just has less stuff, even though the number of lines are pretty similar. Early returns mean that subsequent code has to worry a lot less about the effects of previous code.

There’s definitely a trade-off here. Now when you’re looking for where the function returns, it’s not just at the end — it could be in a bunch of places. I think the trade-off makes sense though because even when I’m tracing through function calls in a backward-fashion, I’m usually reading functions from top to bottom.

Try to NOT use negation

Negation is complexity. In my experience it probably even beats off-by-one errors in the category of “things that are too simple to possibly go wrong that go wrong all the time”.

Example:

Don’t allow additional vegetables in this salad but avoid forbidding the addition of non-orange vegetables

So can this salad have carrots or not?!?

VS

No orange vegetables

Oh

Something like…

1
2
3
4
5
if (!specialCase){
  someStuff();
} else {
  someOtherStuff();
}

…should at the very least be changed to:

1
2
3
4
5
if (specialCase){
  someOtherStuff();
} else {
  someStuff();
}

One exception to this rule: If negation will let you early-return, definitely do that! Hopefully this non-non-negation exception is not too complex. ;)

Now you can pretty much deprecate else.

Often when I can’t return early, I just move the entire if..else block to a method where I can return early. Then I don’t need else.

Trying to restrict my use of else is a great forcing function for creating smaller functions and using early-return more often.

1
2
3
4
5
6
7
8
9
10
11
for (var i=1; i <= 20; i++){
  if (i % 15 == 0) {
    console.log("FizzBuzz");
  } else if (i % 3 == 0) {
    console.log("Fizz");
  } else if (i % 5 == 0) {
    console.log("Buzz");
  } else {
    console.log(i);
  }
}

VS

1
2
3
for (var i=1; i <= 20; i++){
  console.log(fizzMyBuzz(i));  // We wrote this earlier in the post!
}

I’ve been developing software for a few years without doing much estimation at all, and the estimation I’ve been doing has been really vague, eg “You know that feature is a lot of work, right?”. I’ve recently been reading more and more from the #noestimates movement on The Twitter, so I thought I’d chime in a bit with my rationale and experience as well.

“Is the company crazy?”

Yes, but I don’t think the lack of estimation contributes to that. Estimation has a bunch of downsides to it that make eschewing it pretty rational:

  • Teaching people estimation best practices takes a lot of time. Without spending that time, the estimates are terrible. Story points, planning poker, and other methods seem to blow people’s minds.
  • Estimation meetings take a lot of time.
  • It’s almost impossible to ensure that people (engineers or management) are not treating the estimates as deadlines and unnecessarily rushing work (creating unnecessary defects and technical debt that slow the team down more).
  • A lot of estimates are still really terrible and therefore low-value and not worth the effort. Decades of industry-wide software development thinking have not changed that.
  • A lot of estimates don’t actually matter. A 1-week task that is off by 50% but still successful with users, is almost always still a huge success. Sure being off by 50% on a 6 month plan is really bad, but my best recommendation for 6-month plans is “don’t”.

“But then how do people know when to stop working?”

You’ve probably heard Parkinson’s Law that “work expands to fill the time available”. In my opinion it’s a super-cynical way of thinking about people but I’ve known many people that believe it. If it seems true to you in your organization consider these factors that might be contributing:

  • Engineers almost always under-estimate (unless they’re consciously trying to under-promise/over-deliver — cynically called “sand-bagging”). This is because they’re estimating best-case scenarios, and life with complex socio-technical systems rarely falls in the best-case.
  • The focus on how long things take often incentivizes engineers to take short-cuts and to just allow technical debt and complexity to pile up. If they’ve got extra time in an estimate for some reason (hard to believe because of the previous point), they actually start doing the refactoring that they would otherwise not consider doing.
  • They may be gold-plating something because they don’t know what the next most important thing to do is, they don’t understand or believe in the value of it, or they don’t think they’ll get a chance to ever return to the current feature to tweak/improve it. In this case, there are probably trust issues to solve.

Instead…

  • Make sure people are adequately incentivized to move on to the next thing. Probably by making them care about the next thing.
  • Have a clear definition of what it means to be done (ala Scrum’s Definition of Done).
  • Give them all the time in the world for the current thing. Let them relax and do it right so it stays done instead of being a constant source of future interruption as the defects roll in.

If you have trust and the right people, the team will move really fast, especially over the long haul.

“Estimates can be a good forcing function though. Otherwise how do people know when to cut functionality?”

Just cut it all immediately. Decide the absolute essentials of the feature, and cut the rest immediately with no mercy. Deliver those essentials first. Then incrementally try to fit in the next most valuable aspects.

This is the only sane way to ensure you’re doing the highest value work first. You don’t need scheduling — you just need prioritization. The stuff that gets cut will be the low-priority stuff.

“I need some level of predictability in the software development process though!”

That doesn’t mean you’re going to get it. If you’ve been in software development for any length of time, you know that estimates are often wrong and you haven’t figured out how to make them better. In complex systems, prediction rarely leads to predictability.

Instead of trying to predict, you should be aiming to try to mitigate risk as early and often as possible, by doing the riskiest, least clear, and highest value proposition efforts first in the leanest way possible.

This allows you to incrementally move to the more proven, clear, and next-highest-value efforts over time, and have an effort that observably converges on completion. It’s a better shot at predictability, but without prediction.

“But everyone else does it!”

This is the worst of the arguments, in my opinion. I love employing best practices when I don’t have a better way but otherwise they’re the enemy continuous improvement. Estimates on their own deliver no user-value. Any effort that isn’t yielding comparable results should be axed. Any goal that can be better served by other methods should just be solved by those methods.

If you’re not a developer and you still don’t buy my argument…

Let’s flip the script.

As an engineer I’ve yet to meet anyone in management that is willing to give me estimates on the expected value of a feature (preferrably in dollars, but I’ll take whatever the proxy/vanity metric of the day is too!). This would be super-valuable to ensure we’re prioritizing the most impactful stuff first right? And we could check the validity of these estimates after the feature is released right?

I think this would be a hilarious way to turn the tables and see how management does with estimation of complex systems, but in the end I think it would be similarly fruitless for improving predictability. Their estimates would be just as wrong, just as often. They’d be just as nervous about being confronted on their accuracy too.

Estimation is a linear process management tool in a non-linear world.

Estimation just doesn’t really provide us with much predictability considering it’s cost.

That doesn’t mean that no one should ever do it — it will definitely make sense in some cases — but I personally think it’s overused, largely ineffective, and often destructive in most of the cases that it’s used.

I’ve recently seen a case of spontaneous waterfall-style development emerging where the team was generally empowered to avoid it or at least to correct against it, and certainly wanted to, and yet did not. From the outside it seemed like waterfall was a kind of steady-state for them that they’d trended toward and didn’t have the escape-velocity to move toward anything else. They weren’t happy with it either. Most of them felt quite alienated from their work which is a common result of waterfall development. What could have happened?

I think there are a bunch of properties that can lead to waterfall-style development and management is only one of them. Before I get into those, I’d first like to propose a simplified explanation of what it means to be waterfall.

What’s This Waterfall Thing That They’re Talking About?

Waterfall development is development with varying degrees of one-way hand-offs between multiple disciplines. The best way to think of it is to imagine a sort of human assembly line where each homogenous set of disciplinarians pass the result of their discipline down to the next homogenous set of disciplinarians. More concretely, you’ve got these hand-offs when the product managers tell the designers what to design and the designers tell the engineers what to build and the engineers give the testers the work to test and the testers give the release manager a release to release and the release manager gives the sysadmins a deployable to deploy. It’s all very well organized and simple to explain and generally soul-sucking and totally inefficient in practice.

It’s soul-sucking for a few reasons:

  • The downstream disciplines have very little power to make drastic changes, because of the tyranny of sunk-cost fallacy. For example, if the testers find some behaviour that is terrible for the users, the engineers will shrug and say “not a bug… works as designed!”. Or if the engineers determine there’s a performance problem with the fully completed hi-fi design mocks that make them infeasible, the designer is at best forced to go back to the drawing board to take another shot at getting an idea that the engineers might agree on.
  • The downstream disciplines have no idea why they’re designing/building/testing/releasing what they’re designing/building/testing/releasing. Waterfall is set up as an assembly line in the name of efficiency and that means that downstream disciplines are on a need-to-know basis. The problem with this is that you suck the motivation out of everyone because knowing “why” is precisely where motivation comes from.
  • The upstream disciplines are generally resented by the downstream disciplines for how they’re always ignoring the important downstream concerns.
  • The upstream disciplines are constantly pressured to do more and more flawless work “to minimize all that expensive back-and-forth”.

It’s funny when people start to talk about “all that expensive back-and-forth” because trying to avoid that is precisely what makes it so expensive. The upstream disciplines are always expected to trend toward perfection in pursuit of the ideal one-way waterfall. When you hear the engineers asking for perfect hi-fi mocks before they start their work, you’ll know you’re approaching waterfall-land. Your product development flow is about to go into slow-motion.

The Role of Single-Discipline Teams

There’s an idea too that similar disciplinarians should be the ones working and managed together as a unit; that’s how you achieve consistency in the discipline and how you ensure these specialists are always pushing each other to greater heights in their respective crafts. Unfortunately these specialists never really get a full view of the entire process. With neither broad understanding nor a holistic ownership mindset, the end result is that you’re paying a whole lot of smart people while simultaneously trying to avoid them thinking about the business’ problems. A specialist that doesn’t fully understand the business goals or how software development works in general will always be a pretty weak specialist regardless of the depth of their specialization.

This is how you get dysfunctional scenarios like:

  • QA personnel that feel like their primary goal is to try to find a way to block the release.
  • Designers that feel like customers should wait for the perfect aesthetics before we can deliver new functionality.
  • Engineers that can’t make good decisions about when to take on technical debt and when to pay it down.

Having individuals confined solely to their area of expertise can make sense in the short term! For example, I’m an engineer and I’ve regularly had terrible product ideas. In the short term there’s no sense in involving me further “upstream” in the process; I’ll only drag it down.

In the long term though, my terrible product sense is certainly a liability. And on a daily basis when I stand in the shower or on the subway thinking about work, I’m just coming up with more terrible ideas. The better thing to do is to teach me about the business and its goals and make that investment in me. You’ll occasionally get a worthwhile idea from me, and I’ll always be more willing to go that extra mile because I know why I’m building what I’m building and that it’s worthwhile work.

The Role of Product Management

Product management can play a significant role in whether or not that teaching and that context-sharing happens. I’ve seen teams prioritize their own product backlog for multiple weeks of work without a product manager present — That’s a team that the product manager has really invested in and the result is that the product manager is comfortable giving them agency. There are some unexpected things that can happen here: the team will often have better ideas or prioritizations than the PM, or at the very least will have top-of-stream improvements on the PM’s ideas that would never happen otherwise. When this is really working, the one brain of the PM will almost certainly not be able to beat the multiple brains on the team. And when the PM considers themselves a contributing member of the team, the team will necessarily outperform the lone PM.

There’s a simple but slow progression that a skillful PM can push the team through if they’re in waterfall-mode due to the product management stage:

  1. Allow the team to own the solutioning. Take the team from working on predetermined solutions and start to instead involve them in determining the solutions themselves. Achieving this really takes a considerable amount of effort, but it usually pays for itself in spades later with the PM’s freed-up time, better work prioritizations, better product decisions, and less process churn (“expensive back-and-forth”). It’s amazingly motivating for the team too.
  2. Allow the team to own the problem-selection. Once you’re comfortable with their performance in determining the solutions, start to get their input on what the problems to solve are, and coach them to seek problems that meet the team’s goals and show them to independently collaborate with other sources of company/product/user information. Give them access to the metrics they’re trying to improve. In time, be open to letting them try some ideas that you might not think are great; at the very least it’ll be a learning.
  3. Allow the team to have input into the mission. Everything is easier when the team thinks it’s working towards the right goals. The easiest way to get the team working on the right goals is to ensure that they have that context. If you brought the team through steps 1 and 2 above, you can see that they’re much more bought-in to plans that they’re part of. Involve them in actually determining the mission too! The end result will be a team that is incredibly fast-moving and energized.

Not to be too deep, but even though this is framed here as a maturity progression for the team, it can also be a maturity progression for the PM. The PM needs to be able to become the team’s guide to the market, the user’s needs, the company’s goals, etc, etc, instead of playing the part of puppeteer. And we’ve seen that it absolutely doesn’t put the PM out of a job. There’s definitely more than enough work in ensuring that you can be expert guide and a strong team mate. The product manager on my current team was brave enough to engage with the rest of the team like this and the results are really amazing. Instead of pushing product management on the team, he pulls the team into product management.

Also it should be obvious that when the team is spending time on things that are not software production that they will almost certainly release less lines-of-code. I’ve found instead though that the products developed this way are generally always delivering more value to the user, and doing so at a faster pace. No user is asking for “more software”.

Product Management Is Just One Possible Contributing Factor Though

In fact any of the disciplines on the team has the ability to cause the emergence of a waterfall model of development. It’s a systematic problem that is really tough to solve because it emerges from the socio-technical complex system of the software, the team, the company, and the market. As a result, there’s almost never just a single cause of it.

So with the obvious influence of product management aside, these additional properties seem to make waterfall vastly more likely:

  • Inability to see how traditionally downstream disciplines can actually occur simultaneously or even upstream.
  • Perfectionism (release-o-phobia) due to the perception that a particular stage is very expensive or irreversible.
  • An “Only specialists can do X” attitude (often due to perfectionism), sometimes worsened by an approval process requiring sign-off.
  • Lack of will to collaborate across disciplines. There’s an inertia to collaboration that sometimes is hard to overcome.
  • Cargo cult adherence (eg “Such-and-such company does it this way”, “This is the way I’ve always done it”, “My software engineering textbook from 1968 says this is how you do it”)

I’ve personally never seen waterfall-style development occur in the absence of these factors so I think about ways to stamp them out whenever I see them.

For the team that just doesn’t know how to defeat waterfall but wants to, there are a tonne of things to try:

  • Create an environment of collaboration. Usually this involves collocated cross-discipline teams. When other companies are saying “Design is crucial. We should have a design team.”, try to instead say “Design is crucial. We should have a designer on every team.” It’s this attitude that will have your team able to deliver software entirely on its own without endless meetings with other teams.
  • Make sure the team is surrounded by the information that it needs to be to make the right decisions. They should have easy access to your user-researchers, data scientists, customer-support, subject-matter experts, etc, etc to easily learn everything that they have to about their problem space. If you can put those specialists on the team as well, that’s even better.
  • Try to get the team to focus on a very small number of things at once, ideally 1. You want to focus on finishing instead of starting and getting each member of the team to do something different is the opposite of that.

Once you’ve achieved that proper environment, experiment with ways to disassemble the traditional waterfall order of things:

  • Have many people involved in all manner of planning (if they’re interested… Some teammates may prefer to trust the others in their decisions and avoid all those meetings).
  • Have a common set of standard UI components so engineers can assemble pretty good designs quickly on their own.
  • Practice devops! It’s the ultimate collaboration of developers, release managers and sysadmins often with all 3 roles rolled into one person (and heavily automated!) but at least on the same team.
  • Decouple release from deployment with feature flags and actually deploy before design is even complete. Often you can deploy all kinds of non-production-ready stuff constantly and only turn it on when it’s ready. That’s the essence of Continuous Delivery
  • Do testing BEFORE engineering. TDD and BDD are possibilities. If you have QA people, let them advise on what kinds of tests would be useful. That’s what Quality Assurance actually means.
  • If possible, have everyone own testing. If possible, have engineers own test automation and test their own/each other’s stuff.
  • Have lo-fi designs that are good enough to start engineering with. Let the hi-fi designs come later… maybe much later; Depending on your situation, you might be able to launch to 1% of your users with lo-fi designs and validate the product before investing in higher fidelity designs.
  • Have engineers and testers think through the lo-fi versions of the designs with the designer. Treat the designer as the team’s expert guide rather than putting all the design responsibility on them.
  • Break up your work into smaller and smaller pieces. Even if you’re doing some hand-offs, rework on small pieces is much cheaper.

The greater goal is to value results over effort, and to build the right software rather than the most software. Getting as many efforts in progress at once is an anti-pattern. Making sure everyone has something to do is a non-goal.

Without that perspective, many of these practices will seem inefficient. Sure sometimes you’ve got multiple people working slightly outside of their specialty, but the resulting software is generally always better, and all that expensive back-and-forth and rework is virtually eliminated. People are generally happier and more engaged as well because they know how important the work is and why.

Special thanks to Taylor Rogalski for feedback!

“Engineer” really is a very silly term for people that make software. Engineering is supposed to be a predictable application of science toward predictable results. Software development is really anything but that. The biggest problem in our profession is and always has been getting predictable results.

The key difference between a junior engineer and those that are more senior is that the more senior engineer realizes the inherent unpredictability of delivering working software and constantly works toward making that delivery more predictable.

The junior, on the other hand, approaches the work with two key hinderances:

  • Being too intellectually challenged (or enamoured) with the details of programming in the small to be able to see the larger picture
  • Having little experience with complex systems (and often not even understanding that they’re working in one)

As a result the more junior engineer will generally pursue solutions that risk predictability, frequently without realizing it. What’s worse is that often these solutions will be faster/better/cheaper, further reinforcing the approach. Failures with risky approaches are often written off as fluke mistakes, technicalities, or the nebulous “human error” without reconsidering the methodology. Often they’re written off like this even when the cost of the failures vastly outweighs the riskiness of the approach.

In software development, the details can seomtimes be really difficult. This never really goes away with experience, but the Junior Engineer hasn’t yet been faced with that enough times to know it. And when the details are difficult, the junior engineer focuses primarily on the details, often to the detriment of the big picture — predictability. It’s a pretty understandable choice; our brains are limited in how much they can consider at once, and for some reason engineers have a tendency to default to concentrating on the technical details.

obligatory relevant xkcd

Conversely, the more senior engineer will be less encumbered by the technical details and will have been bitten a number of times by the nature of complex systems. If that engineer has taken the opportunity to learn from those mishaps, there’s a chance for a much higher level of engineering (and predictability).

With that in mind, here are a bunch of examples that I think show the differences more plainly:

Junior: Finds a solution.
Senior: Finds the simplest solution.

Junior: Finds a solution for right now.
Senior: Considers the longer term implications when finding a solution.

Junior: Defaults to adding complexity/code with every requirements change.
Senior: Knows that changes often indicate a unifying concept and an opportunity to remove complexity/code.

Junior: Finds ways to manage complexity
Senior: Finds ways to remove complexity (because complexity management techniques are another form of complexity!)

Junior: Predicts results
Senior: Embraces unpredictability and manages it

obligatory relevant xkcd

Junior: Assumes that lack of evidence of problems is evidence of lack of problems.
Senior: Is aware that unknown unknowns almost always exist.

Junior: Can explain why their grand plan is flawless.
Senior: Ships as early and often as possible and judges solutions empirically.

Junior: Believes that it’s sufficient to achieve quality at the edges.
Senior: Knows that often you have to ensure quality of internal components separately to achieve quality at the edges.

Junior: Believes that once the pieces are verified, the system is verified.
Senior: Knows there’s still work to verify the system as a whole.

Junior: Will call “done” when the happy-path works, or the code is committed, or the code passes testing, or the code is in staging.
Senior: Realizes there are so many technicalities involved in actually solving real problems that one shouldn’t call “done” until the user has used it and agrees.

Junior: Believes that once you get the software to work, it will work in perpetuity with no extra effort.
Senior: Realizes that maintenance always takes considerable time and effort.

Junior: Optimizes code that seems slow.
Senior: Measures for performance issues and optimizes only the bottleneck.

Junior: Writes code for the computer to understand.
Senior: Writes code for the computer and humans to understand.

Junior: Focuses entirely on the technical aspects of the role.
Senior: Realizes the entire complex system is a socio-technical one that also necessarily involves people. Constantly tries to also take the humans (managers, stakeholders, users, teammates, etc) into consideration.

Junior: Works tirelessly and continuously to solve a critical and time-sensitive problem.
Senior: Understands that communication during critical times is also crucial, and that “I have no update” is still a highly valued update.

Junior: Can explain why a solution is good.
Senior: Can explain why a solution is better than the others for a given criteria.

Junior: Is learning to be critical of software designs choices.
Senior: Is always trying to understand the tradeoffs in software design choices and can see the good in imperfect solutions.

Junior: Is learning new technology, design patterns, and practices and trying to apply them.
Senior: Judges technology solely on the basis of its ability to solve actual problems and skillfully avoids cargo-cult programming.

Anyway… these are just examples and certainly don’t form an exhaustive list. I’m not at the end of learning from experience either, so I couldn’t even write an exhaustive list if I wanted to.

One more example of the meta-variety:

Junior: Is starting to realize that predictability is valuable and is learning how to achieve it.
Senior: Realizes that different predictability approaches have different costs in different scenarios and makes choices accordingly after cost/benefit analysis.

Special thanks to my friends Deepa Joshi and Sam DeCesare for input/feedback!

The 80/20 Rule, otherwise known as the Pareto Principle has wide-reaching implications for productivity. If you can get 80% of the results you want for 20% of the effort in a given endeavour it really lets you consider an efficiency/thoroughness trade-off that’s right for your desired results. There’s a real opportunity for saving a whole lot of effort and still raking in big results for that effort when you find a scenario that conforms to the 80/20 rule. Unfortunately there are a bunch of scenarios in life that don’t conform to this nice 80/20 distribution and so I’d like to propose another one that I see pretty often.

The All-or-Nothing Principle: For many scenarios in life, there are 0% of the effects until there are 100% of the causes.

So while a Pareto distribution looks something like this:

…an All-or-Nothing distribution is much simpler; you’ve either got the effect or you don’t.

It applies only to systems where results are binary which can also be found all over the place: everywhere from electronic examples like light switches or mouse-clicks, to the more biological — like being pregnant or being dead.

Like the Pareto Principle, the All-or-nothing Principle can have a huge effect on productivity if you recognize it and handle it appropriately.

By Flickr user Paul Mannix - https://www.flickr.com/photos/paulmannix/552103944, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=46305687

Here’s a simple example: Bridges are expensive to build. Building a bridge even 99% of the way across a river gets you 0% of the results. This isn’t at all the type of distribution that the Pareto Principle describes and so you can’t make the same efficiency/thoroughness trade-off with it. You simply can’t get 80% of the cars across with 20% of a bridge, so in order to extract maximum productivity from these scenarios, you have to approach the situation fundamentally differently. The important thing to remember in these scenarios (assuming you’ve identified one correctly) is that there is no difference at all in effect between 0% effort and 99% effort. If all you can give is 99% effort, you should instead give no effort at all.

Let’s talk through the bridge example a little more: If you were a bridge-building company with 99% of the resources and time to build a bridge, you would be in a much better spot in general BEFORE you started at all than you’d be at the 99% mark, even though the 99% mark is so much closer to being complete. In fact, the almost-complete state is a sort of worst-case scenario where you’ve spent the most money/time possible and reaped the least amount of benefit. Where 80/20 distributions reap quite a lot of rewards for very little effort upfront, all-or-nothing scenarios reap no reward whatsoever until effort meets a specific criteria.

There are lots of examples of this in the real world too. Medical triage sometimes considers this trade-off especially when resources are constrained like in wars or other scenarios with massive casualties. Medical professionals will prioritize spending their time not necessarily on the people closest to death, but the people that are close to death that have a reasonable chance of being saved. Any particular person’s survival is an all-or-nothing proposition. It’s a brutal trade-off, but of course the medical professionals know that if they don’t think they can save a person, they should instead spend their efforts on the other urgent cases that have a better chance; the goal afterall is to maximize lives saved. In the real world, probabilities about outcomes are never certain either so I’m sure this makes these trade-offs even more harrowing. (Interestingly, if a triage system like this allows 80% of the affected population to survive by prioritizing 20% of them, the effort on the population as a whole conforms to the Pareto Principle)

The All-or-Nothing Principle and Software Development

Fortunately for me, I work in software development where being inefficient (at least in the product areas I work in) doesn’t cost anyone their life. All-or-nothing scenarios actually exist in dozens of important software development scenarios though.

Of all the scenarios, I don’t think any has shaped the practices and processes of software development in the last 25 years as much as the all-or-nothing scenario of “shipping” or “delivering” product to actual users. The reason for that is that software only gets results when it’s actually put in front of users. It really doesn’t matter how much documenting, planning, development, or testing occurs if the software isn’t put in front of users.

This specific all-or-nothing scenario in software is particularly vicious because of gnarly problems on both sides of the “distribution”:

  1. Expectations of the effort required to get to a shippable state can be way off. You never actually know until you actually ship.
  2. Expectations of the effectiveness of the software to meet our goals (customer value, learning, etc) can be way off. Will users like it? Will it work?

Where more linear effort-result distributions (including even 80/20 distributions) get you results and learning along the way, and a smoother feedback cycle on your tactics, all-or-nothing scenarios offer no such comforts. It’s not until the software crosses the threshold into the production environment that you truly know your efforts were not in vain. With cost and expected value being so very difficult to determine in software development the prospects become so much more difficult/risky. These are the reasons that a core principle of the agile manifesto is that “Working software is the primary measure of progress.”.

It should be no surprise then that one of the best solutions that we’ve come up with is to try to get clarity on both the cost and expected value as early as possible by shipping as early as possible. Ultimately shipping as early as possible requires shipping as little as possible, and when done repeatedly this becomes Continuous Delivery.

Pushing to cross the release threshold often leads us to massive waste reduction:

Users get value delivered to them earlier, and the product development team gets learnings earlier that in turn lead to better more valuable solutions.

All of this of course depends on how inexpensive releasing is. In the above scenario, we’re assuming it’s free, and it can be virtually free in SaaS environments with comprehensive automated test suites. In other scenarios, the cost of release weighs more heavily. For example, mobile app users probably don’t want to download updates more often than every 2-4 weeks. Even in those cases, you probably want a proxy for production release, like beta users that are comfortable with a faster release schedule.

We often find ways to release software to smaller segments of users (beta groups, feature toggles, etc) so we can both reduce some of the upfront effort required (maybe the software can’t handle all the users under full production load, isn’t internationalized for all locales or doesn’t have full brand-compliant polish) but also get us seeing some value immediately so we can extrapolate as to what kind of value to ultimately expect. The criteria of even releasing to one user has a surprising ability to force the unveiling of most of the unknown unknowns that must be confronted for releasing to all users, and so it’s a super-valuable way of converging on discovering the actual cost for delivering to all users. I’ve personally found these practices to be absolutely critical to setting the proper expectations about cost/value of a software project and delivering reliably.

Furthermore, there is a lot more to learning early than just the knowledge gained. Learning early means you have the opportunity to actually change what you expected to deliver based on things you learned. It means that the final result of all this effort is actually closer to the ideal solution than the one that loaded more features into the release without learning along the way.

Navigating All-or-Nothing Scenarios

Again, the benefit of identifying All-or-Nothing scenarios is that there are particular ways of dealing with them that are often more effective than solutions that would be common with other types of scenarios (like 80/20 scenarios). I’ll try to dig into a few in more detail now.

Try to make them less all-or-nothing

We’ve seen from above how continuous delivery breaks up one all-or-nothing scenario into many, thereby reducing the risk and improving the flow of value.

More theoretically, it’s best to try to change unforgiving all-or-nothing scenarios into more linear scenarios where possible. That completely theoretical bridge builder could consider building single lane bridges, or cheaper, less heavy-duty bridges that restrict against heavier traffic, or even entirely different river crossing mechanisms. These are just examples for the sake of example though… it’s probably pretty obvious by now that I know nothing about building bridges!

Software is much more malleable and forgiving than steel and concrete, so us software developers get a much better opportunity to start with much lighter (but shippable) initial passes and to iterate into the better solution later. Continuous delivery, the avoidance of Big-Bang releases, is an example of this.

Minimize everything except quality

In all-or-nothing scenarios, reducing the scope of the effort is the easiest way to minimize the risk and the ever-increasing waste. It can be tempting sometimes to consider also reducing quality efforts as part of reducing the scope, but I’ve personally found that reducing the quality unilaterally often results in not actually releasing the software as intended. The result is that you leave a lot of easy value on the table, don’t capture the proper learnings early, and set yourself up to be interrupted by your own defects later on as they ultimately need to be addressed. If you don’t really care about the quality of some detail of a feature, it’s almost always better to just cut that detail out.

Realize that time magnifies waste

For the types of work that need to be complete before the core value of an endeavour can be realized, there’s not only a real cost to how long that endeavour remains incomplete, but the more it becomes complete without actually being completed, the more it costs. In classical Lean parlance, this is called “inventory”. Like a retail shoe store holding a large unsold shoe inventory in a back storeroom (it turns out that selling a shoe is generally an all-or-nothing scenario too!), effort in an all-or-nothing scenario has costs-of-delay. A shoe store’s costs-of-delay include paying rent for a back store-room and risking buying shoes it can’t sell. It’s for this reason that Toyota employs a method called “Just-in-time” production. It’s the cornerstone of Lean manufacturing (that has coincidentally also had a huge influence on software development!).

In the case of software, larger releases mean more effort is unrealized for longer. This results in less value for the user over time and less learning for the company. If you can imagine a company that only releases once a week, high-value features (or bug fixes) that are completed on Monday systematically wait 4 days before users can get value from them. Any feature worth building has a real cost of delay.

Similarly, more parallel work in progress at once means there’s multiple building inventories of unfinished work. If a team can instead collaborate to reduce the calendar time of a single most-valuable piece of work, this waste is greatly reduced.

Stay on the critical path

In all-or-nothing scenarios, it’s crucial to be well aware of the critical path, the sequence of things necessary to get you to the “all” side of an all-or-nothing scenario. These are the things that will affect calendar time and affect how quickly you learn from users and deliver value. In software engineering, this is often called the steel thread solution. If you find yourself prioritizing non-critical path work over critical path work (which is common if you’re not ever-vigilant), you’d do well to stop and get yourself back on track.

Pick your battles!

If you’re faced with an all-or-nothing scenario, and you know or come to realize that you can’t make the 100% effort required, the most productive thing you can do is to not even try. Like the 99% complete bridge, your efforts will just be pure waste.

Sometimes this can be extremely hard. It’s why the term death march) exists for software projects and project management in general. People just have a really hard time admitting when a project cannot succeed, even when they know so intellectually. It can be particularly difficult when the project has already had effort spent on it, due to sunk cost fallacy, but that doesn’t change the proper course of action.

If you find yourself in a truly unwinnable all-or-nothing scenario, it’s best to forgive yourself for not pursuing that project anymore and to put your efforts/resources somewhere with better prospects. It’s not quitting — it’s focusing.