Tuesday, February 28, 2012

James Hamilton studies some failures

Regular readers of my blog will know that I'm an enthusiastic proponent of the study of failure. When something unexpectedly goes wrong, there is always something to learn.

Thankfully, James Hamilton is a big supporter of that point of view as well, and happens to have written several wonderful essays over the past few weeks on the topic.

Firstly, Hamilton wrote about the Costa Concordia grounding, and then followed that up with a second essay responding to some of the feedback he got. This is obviously still an active investigation and we are continuing to learn a lot from it. Hamilton's essay has some wonderful visual charts illustrating the accident, and speculating on some of what was occurring, together with a massive amount of supporting information discussing what is currently known.

My favorite part of Hamilton's essay, though, is his conclusion:

What I take away from the data points presented here is that experience, ironically, can be our biggest enemy. As we get increasingly proficient at a task, we often stop paying as much attention. And, with less dedicated focus on a task, over time, we run the risk of a crucial mistake that we probably wouldn’t have made when we were effectively less experienced and perhaps less skilled. There is danger in becoming comfortable.
Very true, and very important words. Not to reduce it to the overly-mundane, but I recently got a traffic ticket for rolling through a stop sign and I had my opportunity for that once-a-decade visit to Traffic School. Although the fines and wasted time were an annoyance, it was clear by the end of Traffic School that in fact my 35 years of driving experience have become somewhat of an enemy; there were many specific details about how to drive safely and legally that I was no longer paying attention to, which the course materials recalled to the front of my mind.

There is, indeed, danger in becoming comfortable.

Secondly, Hamilton wrote about another fascinating incident: the loss of the Russian space mission Phobos-Grunt.

As Hamilton notes, there is a very interesting report on this incident in the IEEE Spectrum magazine: Did Bad Memory Chips Down Russia’s Mars Probe?.

But, as Hamilton observes, although the analysis of memory chips and radiation effects and system faults is fascinating and valuable, there is a further, deeper sort of failure:

Upon double failure of the flight control systems, the spacecraft autonomously goes into “safe mode” where the vehicle attempts to stay stable in low-earth orbit and orients its solar cells towards the sun so that it continues to have sufficient power.

...

Unfortunately there was still one more failure, this one a design fault. When the spacecraft goes into safe mode, it is incapable of communicating with earth stations, probably due to spacecraft orientation. Essentially if the system needs to go into safe mode while it is still in earth orbit, the mission is lost because ground control will never be able to command it out of safe mode.

...

Systems sufficiently complex enough to require deep vertical technical specialization risk complexity blindness. Each vertical team knows their component well but nobody understands the interactions of all the components.

Kudos to Hamilton for the well-researched and thoughtful observations, and for providing all the great pointers for those of us who, like him, love studying failures and their causes.

What failure will we be studying next? Well, it sure looks like there's a lot to learn from this one: The Air Force Still Doesn’t Know What’s Choking Its Stealth Fighter Pilots.

America’s newest stealth fighters have a major problem: their pilots can’t breathe, due to some sort of malfunction in the planes’ oxygen-generation systems. For months, the Air Force has been studying the problem, which temporarily grounded the entire fleet of F-22 Raptors and may have contributed to a pilot’s death. Today, the Air Force admitted they still don’t know exactly what’s causing the issue.
It looks like this question has been under study for several years, and may still take some time to resolve. The Wired article has a number of pointers to previous articles about the problem. I'll keep an eye on this one, eager to learn from the detailed analysis of the failures.

No comments:

Post a Comment