It is widely known that water boils at 100 degrees Celsius. But that’s not quite right. It requires some impurity to nucleate around. Most of our water and quite a few containers provide enough impurities that this isn’t a problem in practice. But if you take distilled water and microwave it in a porcelain teacup you can get it well above the boiling point. It will look perfectly calm on the surface. But put a spoon into the water and it will explode. The spoon is the proximate cause. But superheating water is a root cause and the use of distilled water and a smooth vessel are root causes beyond that.
In my experience we spend a lot of time focusing on the spoon when a failure happens. But pulling the thread towards the root cause is more often the better use of time. But the solution certainly isn’t to never stir hot water. It involves understanding the properties of the water itself and the container it is in. And even then, you can’t stop at the first root cause explanation you arrive at. Indeed failure theory suggests there is no such thing as a single root cause in any meaningful accident.
When considering a failure I have to manage my natural human instinct to draw simple cause and effect style narratives. It can be surprisingly hard to deny myself the emotional satisfaction of a quick and easy explanation. I have to work hard to delay judgement and continue to patiently gather information. When I do it right, I find myself reminded of the Zen parable of the farmer whose horse escaped. His neighbors comment to him what bad luck he has had and he replies “maybe.” His horse returns with seven wild stallions and his neighbors comment on what good luck he has had and he replies “maybe.” His son attempts to break one of the stallions and breaks his leg, prompting his neighbors to once again lament his luck to which he replies “maybe.” The next day the army comes to conscript his son to service but cannot because his leg is broken. What good luck, right? Maybe.
Facebook developed a surprisingly strong culture around this very early on. When there is a major site issue, called a SEV, everyone bands together to understand and solve the problem without seeking blame. After the issue is resolved we hold a SEV review to understand how and why it happened. The people who are responsible attend but the goal of the meeting isn’t to scold or castigate them. It is to understand the complex series of conditions that made the error possible or even likely. We think about what carefully balanced protections can be put in place to reduce the likelihood of it happening in the future. In this way we improve our performance over time without creating an adverse incentive around making changes.
Narratives around success are just as complicated as those around failure. Once again we are inclined to jump to simple cause and effect explanations. When a new product fails to gain traction it is common to hold a “post-mortem” and understand what we can learn from it. But we should do the same work after a product succeeds. There is likely just as much to learn.