One lesson for product managers, project managers, and IT solution architects is already clear from the current nuclear reactor overheating crisis at Fukushima in Japan. When all of your failover systems can fail simultaneously due to the same cause, you don’t have a redundant failover system. You have a single point of failure with multiple moving parts.
According to CNN’s television coverage, the nuclear reactors at Fukushima shut down as designed when the earthquake hit. However, even after you shut down a nuclear reactor of that design, you have to keep actively cooling it for a long time to avoid a meltdown due to the residual heat in the core. These reactors had several available backup sources of power to run the active cooling systems: the national electrical grid, thirteen backup diesel generators; and backup batteries that last for eight hours.
Unfortunately, the combination of the earthquake and tsunami cut off power from the electrical grid and caused all thirteen backup generators to fail about an hour after the earthquake, apparently because they were hit at that time by the tsunami. This left the nuclear reactors with only eight hours of power available to them from the batteries, which isn’t enough time for the reactors to cool down to a safe temperature. Now there’s a “strong possibility” that reactors one and three have experienced partial meltdowns, and the utility has resorted to flooding some portion of those reactors with raw seawater and boron in a last-ditch effort to prevent the situation from getting worse. This morning, reactor number two temporarily had its cooling pump fail and had its fuel rods fully exposed for thirty minutes, meaning it may have suffered a partial meltdown as well.
We need to be fair to Japan and Tokyo Electric Power. Currently available information is fragmentary and preliminary. This earthquake was one of the largest in recorded history, was the largest in Japan’s recorded history, and was beyond the magnitude that designers were required at the time to plan for. Anytime a magnitude 8.9 earthquake occurs near major population centers, it will cause enormous damage. Japan tries harder than any other country to prepare well for earthquakes and tsunamis. If the same size of earthquake and tsunami occurred at the same distance from a United States nuclear reactor along the coastline, the results would probably be even worse. And there are always some high-impact, low-probability risks, such as an asteroid hitting the middle of the Pacific and generating a 1000 foot high tsunami, that it’s not cost-effective for a single installation’s architects to design against.
With that said, anyone who decides to build a nuclear reactor is responsible for ensuring that it will not have a significant chance of a meltdown due to reasonably foreseeable risks that have a significant possibility of occurring during its operational lifetime. It looks like there was a serious failure in the risk analysis, design, and approval process on this point. The designers and regulators knew that Japan is located on multiple fault lines and is subject to major earthquakes. They also knew that earthquakes can cause a tsunami. The designers nonetheless chose to build a nuclear reactor directly on the coastline and to site all of their backup generators in a location that could be swamped by a tsunami about 20 feet high, and regulators approved this plan. It appears that no meaningful changes were made after the 2004 Indonesian earthquake generated tsunami waves up to 80 feet high, which should have been a wake-up call to operators of coast-side nuclear reactors worldwide that earthquakes can generate giant waves. Now we’re seeing the results.
After the reactor crisis has played itself out to its ultimate resolution, there will doubtlessly be an investigation. I suspect that at some point, a risk analysis document will be found in which the designers estimate the probability of all thirteen backup generators failing simultaneously. This risk analysis will probably be written in one of two ways.
- The risk analysis may calculate the risk of each backup generator failing and then estimate the risk of all of them failing simultaneously by multiplying each generator’s risk of failure together, concluding that the risk of them all failing simultaneously is statistically very, very low. However, such an analysis assumes that the backup generators are all independent systems. As this crisis has demonstrated, the backup generators were NOT independent of each other. Because they were all in the same coast-side, sea level location, they all shared the common vulnerability of being shut down simultaneously by the same tsunami. Therefore, the actual risk of them all failing simultaneously due to a tsunami was equal to the risk of a single one of them failing due to a tsunami. Since all thirteen backup generators in actual fact failed when hit by this tsunami, the risk that each backup generator would fail when hit by a tsunami of this size appears to have been 100%. Effectively, the backup generators as a group were a single point of failure, and constructing the backup diesel generators in this way was implicitly a bet that a tsunami of this size would not occur during its operational lifetime, a bet that Tokyo Electric Power and Japan just lost.
- Alternatively, the risk analysis may include a caveat that an earthquake and tsunami of sufficient size could shut down the reactor, cut off the electrical grid, and swamp all the backup generators. If so, the risk analysis probably either states that such an event is statistically unlikely to occur during the reactor’s operational lifetime or that one or more backup generators could be brought back online within the eight-hour lifetime of the backup batteries. Both assumptions have been proven wrong by what actually occurred in this incident.
It is possible that if the designers had placed two or three of the backup diesel generators at a higher elevation with secure, flexible, earthquake-resistant underground power cable connections to the reactors, we wouldn’t be having any problems with these reactors right now. Instead, we have to hope that the last-ditch cooling techniques of flooding the reactors with seawater (to cool them) and boron (to absorb neutrons) and the strength of the containment vessels will together prevent uncontrolled meltdown events and the release of massive amounts of radiation. If all else fails, the containment systems should probably prevent a catastrophic release of radiation into the environment. Let’s pray that’s the case.
The lessons from Fukushima are not limited to the operators of coast-side nuclear reactors. As a product manager, project manager, or IT solution architect, learn from the crisis at Fukushima and ask yourselves these questions:
- Are my failover systems independent of each other, or do they share one or more single points of potential failure?
- What is the statistical probability that one of those single points of failure will in fact fail?
- What are the consequences in the worst-case scenario and in other, lesser scenarios if one of those failures occurs?
- Have I invested in failover and backup systems to the point that further investment is not cost-effective and that reasonably foreseeable risks will only cause acceptable damages and costs?
Don’t just point fingers at Tokyo Electric Power or the government of Japan. Look in the mirror and ask yourself if your own product or project is next in line for a catastrophe, and then ask yourself what you can do to reduce significant risks!
voximate
RSS
I’ve trying to track down the root cause here – the reports I’ve read said the tsunami hit land 8 minutes after the quake – I’m trying to reconcile the 8 min vs hour times… perhaps it was further away from the epicenter.
My current theory is the PDU was flooded. I was involved with an outage of a computer system that could never fail – where the PDU had been allowed to be a single point of failure in the design – since it has nothing that can break – until it does.
Hi Fred – Here are things that I have read in the New York Times, Wall Street Journal, and heard on KCBS radio and CNN today and in previous days:
1) As designed, the reactors scrammed (shut down by fully applying the control rods) when the earthquake hit.
2) As designed, the backup power generators started automatically when the reactors scrammed, and they were successfully generating power and cooling the reactors until the tsunami hit, at which point they all immediately failed and the facility switched over the backup batteries with an 8-hour lifespan. Attempts were made to deliver replacement batteries to the facility.
3) The 13 backup power generators were all placed underground and may have been flooded by the tsunami.
4) The fuel tanks for the backup power generators were aboveground and may have been swept away by the force of the tsunami.
5) The length of time between the earthquake and landfall of the tsunami varied by distance between the earthquake’s epicenter and the point on the coast in question. (Obviously. Tsunami waves start at and around the location of the earthquake and radiate outward.) I have heard that there was about a one hour delay between the time of the earthquake and the time the tsunami hit the reactor site.
6) An attempt was made to install a new backup generator at the reactor site, but it was unsuccessful because the power couplings on the new backup generator and the reactor did not match.
[...] explosions. The Fukushima nuclear reactor crisis, which we discussed yesterday as an example of how failover systems need to be truly redundant, is providing a couple of dramatic examples of the problem for the entire world to [...]
[...] Fukushima Lesson: Failover Systems Must Be Truly Redundant! [...]
[...] American nuclear plants are not designed to handle the kind of blackout that occurred at Fukishima. All thirteen backup generators failed at Fukishima. But it doesn’t take a Tsunami to produce a failed desiel generator. They can [...]