The Reliable Org

My software engineering background colors my management style in many ways.

Here’s an example that’s a cornerstone for my definition of good management:

Most people in software know a “200 OK” is a web server response that says that the request was successful, or a 400 Bad Request error that something went wrong. Less common is a “202 Accepted” response, but it’s worth explaining: A 202 Accepted response means “I got your request, thanks. I’ll work on that.”

An endpoint that uses 202 is one that I would call “unreliable”, and that’s generally a system property to avoid with all else equal, just like it sounds. An endpoint like that gives you no idea if your request will ultimately work -- just that the system accepted your request. With an endpoint that returns 200 OK or an error like 404 Not Found though, you get to find out and find out right away.

When talking about whether an endpoint is “reliable” or any system or component is “reliable”, people usually think it means that it’s really good or perfect at doing its job.

“Reliability” from a distributed computing perspective means that there’s a way for the system to communicate failure, not that it works really well. An endpoint that gives 202 responses would be considered unreliable even if it had a 0% failure rate over a decade. This is a critical difference because when a component can communicate problems that it encounters, you can do things like fix the component, or retry the component, or switch to a different similar component. That is to say you can manage the component.

Often less experienced engineers think that this definition of reliability doesn’t really matter as long as the component is just perfect at its job all the time, but a more robust system can be built by understanding that perfection is impossible, and instead we should build a system of components that communicate when they fail so we can properly manage the situation when inevitable failures do occur.

So what does this have to do with people management? Well people and teams and divisions and business units are all components of the larger system of the business. Just as reliability is critical in distributed systems, it plays an equally important role in managing teams and organizations. What does reliability by this new definition mean for systems of people? It means that people communicate in a way that meets our new version of reliability: they can and do tell you when things are not going well.

Reliability can usually simply be built into a software system, but for humans that bring a more emotional aspect to the system, you can’t get reliability that easily. You need to consider multiple factors to make a system of people reliable. I’d like to chat now about the hardest one to get right and the one that most organizations lack: psychological safety.

Psychological safety is when people are free to voice opinions, disagree, take risks (and fail), and make mistakes without fear of punishment or other negative consequences. An organization that can freely say when things have gone wrong is one that can be considered reliable in the same way that technical components that communicate failure can be considered reliable.

Most companies don’t care about psychological safety. Here’s what they do instead:

I’ll give you a real world example: A friend of mine is a head nurse at a prestigious hospital. She discovered that another nurse had accidentally given an incorrect and dangerous medication to a patient and didn’t know if she’d be fired or reprimanded. Of course my friend corrected the situation immediately, but things could’ve gone a different way pretty easily. Additionally upon further investigation, the dangerous medication was stored immediately next to the correct medication and had all the same colored labels. Management can’t reduce the instances of cases like this if they don’t know about them, or the details of their causes, and if they can’t think beyond blaming the person that gave the wrong medication.

Blame makes people hide things. People hiding things makes them unreliable.

Now you can probably think of a million other examples of this in real life. Have you ever heard of an aviation disaster where they conclude “human error”? That’s another ridiculous example of complete lack of psychological safety. If they stop the investigation at “human error”, and still rely on humans in the airport, air control, and on planes, then I don’t know about you, but I’m personally pretty terrified for exactly the same thing to happen again. If they instead assume that the human did the best they could in the situation and that we should find another way to improve our chances next time, then they’re actually on a path of learning and improvement.

You have to choose between the ability to blame people and the ability to learn and improve. You can’t have both.

For this reason an organization that has psychologically safety is an organization that:

Once you see it, it’s hard to consider any other type of management as management at all.



Tags: management, psychology safety, reliability

← Back home