Will your company, product, or project have a major disaster like the BP gulf oil spill, the PG&E gas pipeline explosion, or the Heartland Payment Systems credit card security breach? Have you made the right amount of effort and spent the right amount of money to identify and prevent events that have a low probability of occurring but a high impact if they happen? And how do you know what’s the right amount?
Ways Product and Project Managers Can Reduce the Risk of Catastrophic Failures
Product, project, and operational management always includes an element of risk management. Most software product and project managers are fortunately responsible for products and projects that won’t cause death or injury if they fail. No one is seriously harmed if Twitter goes down. But we still face a variety of risks that have a low probability of occurring but a high impact when they occur. What’s the risk that our product could be hacked and compromise private customer information? What’s the risk that our data center could fail and lead to a service outage? Product and project managers can do many things to reduce the risk that our own products, projects, and customers will experience a failure with catastrophic impact.
- Study history. Look at the history of your own and other companies and any problems they’ve had. Learn from past mistakes and those of others and avoid repeating them.
- Study how to estimate risks accurately. Study basic statistics, statistical significance, and the difference between correlation and causation. Read books like Everyday Irrationality by Robyn Dawes that will teach you to recognize and avoid common mental errors in risk analysis.
- Be proactive, not reactive. Don’t wait for disaster to happen. Anticipate the possibility and aggressively get out in front of it now.
- Identify possible risks. You’re unlikely to prevent a risk you haven’t thought of. Work with your team to brainstorm a list of possible risks with an open mind. Circulate the list of risks internally and get feedback. Consider hiring an outside analyst where appropriate.
- Estimate the probability and likely impact of each risk. For example, assume that any possible security exploits in a product deployed on the Internet will be found by black hat hackers and fully exploited. Determine the impact of such exploits.
- Determine what you can do to reduce each risk. Sometimes, whole classes of risks can be inexpensively eliminated by designing a product well up front. At Kontiki, we built encryption into our content delivery system to prevent hostile interception of file transfers.
- Estimate the cost of each risk reduction measure. If you decide to go ahead and implement a measure because it’s obviously easy or clearly necessary, you can skip this step. If people are opposing the measure, you will need to estimate its cost to help cost-justify it.
- Ensure that all cost-effective risk-reduction measures are taken. Knowledge alone is not enough. Action is necessary. A Caltrans report documented well in advance of the Loma Prieta earthquake that the Oakland double-decker highway was built on fill and would likely collapse in a major earthquake. Nothing was done, and people died needlessly as a result.
- Ensure that estimates are objective, accurate, and not “fudged” to make the numbers look better. A risk management system is only as good as it is honest and accurate. Managers may have a financial incentive to fudge the numbers to justify skipping expensive risk management steps and make short-term financial results look better. Resist and escalate if you are encouraged to bias estimates.
- Document policies that will reduce risks. Tribal knowledge and oral communication aren’t enough. Risk reduction practices must be put in writing to enable accurate communication, training, and compliance. The Information Technology Infrastructure Library (ITIL®) provides valuable resources and training to assist in developing quality documentation.
- Train employees to ensure they understand policies and procedures. Make sure that your engineers know best practices in software development and that your operations staff know how to secure systems.
- Monitor compliance. Make sure QA adequately tests your product to verify it works as designed, including features for risk reduction. Make sure operations follows documented procedures to ensure that vulnerabilities aren’t introduced. Track operational metrics that indicate whether you’re meeting your Service Level Agreements.
- Promptly respond to and fix overlooked problems when they’re discovered. Perhaps a security review discovers a buffer overflow exploit that engineering and QA overlooked. Fix it and release an updated version of the product to close the security hole. It’s been reported that PG&E received multiple reports of gas smells before the San Bruno explosion. Don’t you think PG&E wishes they’d investigated those reports more quickly and more thoroughly and proactively investigated the possibility that the nearby major gas line was developing a breach?
- Don’t let cost-benefit testing become an excuse for inaction. When risk reduction measures are proposed, people will sometimes use the fact that perfect risk elimination is usually impossible as an excuse for avoiding risk reduction measures that are in fact necessary, reasonable, and cost-effective. Make sure that awareness of cost-benefit analysis isn’t misused as an excuse for doing nothing.
- Challenge denial. When faced with a reality they don’t like such as the need for expensive risk reduction measures, people will sometimes retreat into the psychological defense of denial. They will argue without evidence (or in the face of contradictory evidence) that a risk isn’t real, won’t occur, won’t have serious consequences, or isn’t worth addressing for other reasons. Don’t back down in the face of illogical statements. Use data to challenge and overcome denial.
- Prevent a culture of complacency. Companies sometimes develop a dysfunctional culture in which violation of risk management rules is taken for granted and employees begin to consider it normal. Prevent this. Take risk, safety, and compliance seriously and flag all violations for resolution.
- Do more than you are doing already. People tend to underestimate the effect of risks that happen rarely but are high in impact when they occur. Managers are often rewarded for short-term cost cutting but rarely punished for disasters that haven’t happened yet and may never happen. As a result, companies have a chronic incentive to underinvest in risk prevention. You’re probably not doing as much as you should. Do more.
- Have some employees who are responsible for and compensated based on successful risk management and legal compliance. If everyone in your company is rewarded for short-term company profitability and no one is rewarded for monitoring risk management and legal compliance, you’ll create a structural incentive for inappropriate risk taking.
- Escalate through management channels if inappropriate risks are being taken or policies are being violated. Work within your company to solve problems you discover. Well-managed companies will respond to the flagging of issues and reward those who report them.
- If your company can’t build or operate a product safely, propose exiting that line of business and recalling the product where necessary. Keep in mind the possibility that perhaps your company shouldn’t be offering a particular product at all.
Cost-Benefit Testing is Unavoidable
In theory, we would like companies to do “everything possible” to reduce every known risk to the lowest level possible with current technology. In practice, whether we like it or not, the amount a company spends on risk management is limited because resources at any company and in the economy as a whole are limited. No matter how much money a company or the government has spent on reducing risk, it could always spend more to further reduce more risks. The more you reduce risk, the more expensive further risk reduction becomes.
In practice, the amount of money a company spends on risk management for a product is limited by competition and the resulting increase in the total cost of its product to the customer. If its product’s total cost of ownership rises high enough, customers will choose another company’s similar product or different, substitute products because their price is lower.
Consider the problem of preventing malware hijacking of browser sessions and keylogging atticks. This is a serious problem that can cause your bank account to be drained. But should a browser vendor do “everything possible” to prevent it? A browser vendor could require that you purchase and install antivirus software on your machine before its browser will start up. It could require that your antivirus subscription be up to date and verify this each time the browser starts up. It could limit the browser to browsing a list of known web sites and block you when you visit an unknown one. It could require that you hire a trained security expert to stand behind you and watch your actions on the screen whenever you surf the Internet and only start up when the expert typed in a password to prove they were present. Obviously, the last option is cost-prohibitive. People wouldn’t use a browser that imposed all these requirements; they would choose to use a different browser.
The problem gets much harder when the risk of human injury or death is involved. If your bank account is drained by a hacker, the harm is largely economic. The risk posed by some products includes serious injury or death. Obviously, when the risk of injury or death is involved, vastly greater spending on risk management is justified morally (by the obligation to prevent harm to others), financially (by the risk of lawsuits), and legally (by statutory or regulatory requirements).
Even on products where the risk of injury or death is involved, an element of cost-benefit testing is unfortunately inevitable. Consider the backyard swimming pool. It’s a product that carries the risk of accidental injury or death by drowning. Should a pool manufacturer do “everything possible” to eliminate the risk of accidental drowning? The manufacturer could include a password-protected, hydraulically-powered wall and ceiling around its pool that would prevent unauthorized entry. It could require you to have two paid professional lifeguards present when the pool was in use and only open the wall when each lifeguard entered a separate passcode known only to them. Obviously, this would make the pool unaffordable and cause it to fail in the marketplace in competition against other pools lacking such protections. If the government were to require such protective measures for all pools, it would make backyard pools unaffordable for all but the richest people and deny others the benefits of exercise in backyard pools which may reduce the risk of other life-threatening problems like obesity and diabetes.
Cost-benefit analysis is also necessary to make sure a company is choosing the best ways to reduce risk as much as possible as quickly as possible given a particular level of spending. When resources are limited, total risk will be reduced most quickly if you invest in inexpensive measures that make a big difference first and make expensive changes with less benefit later.
Any genuine attempt at risk management clearly needs to include an element of cost-benefit analysis. The real question is not “Have we done everything that could possibly done with unlimited resources to identify and reduce risks?” The answer to that question is always “no, with unlimited resources, you could always do more.” The question is “Have we done the right amount to identify and reduce risks?” and “What’s the right amount?”
The Problem of Low-Probability, High-Impact Risks
Managing risk gets much harder when events have a low probability but a high impact when they occur. Nassim Taleb calls these Black Swan events. (Read the book if you haven’t already.)
When events happen frequently, we have ample data and can make estimates about their frequency and impact with confidence. For example, car accidents are common. Each year in the United States, 40-45,000 people die in automobile accidents. The impact of a single car accident is also usually limited to the car’s occupants and another car or nearby pedestrians. Statisticians can therefore estimate the number of deaths that would be prevented by safety measures like seat belts and air bags, estimate how much those measures would cost, and determine the cost-effectiveness of those safety measures with high confidence.
Risk assessment and management gets much harder when events have a low probability but a high impact when they occur. The recent BP and PG&E disasters tragically illustrate this problem.
- Reuters reports that more than 50,000 wells have been drilled in federal waters in the Gulf of Mexico since 1947. Yet when the Deepwater Horizon blowout occurred, it caused the “largest offshore oil spill in United States history.” Major oil spills have a low frequency but a high impact when they occur. (Note that current drilling activity in deeper water and higher-pressure formations may be riskier than historical drilling activity, so an accurate risk estimate on current activity is more complicated than a simple average over all historical well data, but clearly, major oil spills happen less often than car accidents and have far broader effects.)
- Most houses and apartments in America have natural gas connections. Gas pipeline explosions are fortunately infrequent, but they can be devastating when they occur in a populated area. Imagine if the recent San Bruno pipeline explosion had taken place in the middle of the night next to a twenty-story apartment building. Even more lives would have been lost.
Because the events don’t occur often, data is limited and reasonable people can disagree about the actual probability of an event. Because the events have high impact when they occur, the cost of being wrong is enormous. These two factors make accurate cost-benefit analysis extremely hard in such cases.
The Problem of Financial Conflict of Interest
For risks that are low in probability but high in impact, the perpetual financial incentive for short-term cost cutting creates an additional problem. Annual spending is usually necessary to reduce the risks of such disasters. Gas utilities need to regularly inspect their pipelines to detect corrosion before it causes a leak. Credit card processors need to scan their systems for the presence of malware. Software developers need to do code reviews to detect potential vulnerabilities.
Since business managers are usually compensated at least in part based on short-term company profitability (whether directly by bonus plans or indirectly through company stock), they have a perpetual personal financial incentive to cut costs. Project managers may get a bonus for early completion, and product managers may be rewarded for getting a product to market quickly. If managers cut spending on measures to prevent disasters that are low in probability, company profitability will immediately improve, but a disaster likely won’t occur right away. The manager may be rewarded and promoted for their good cost controls, which will encourage them and others to make still more cuts, all the while increasing the chance of a disaster occurring. This process will continue until a disaster occurs, which will finally create pressure for more appropriate risk management measures that might have prevented the disaster in the first place.
Companies need to make sure that their compensation schemes aren’t excessively based on short-term company financial performance. Restricted stock that vests over time is one possible tool. Managers should have effective risk management as one of their performance review criteria. Companies also need to make sure that there are employees in the company with specific responsibility to monitor and manage risks and compliance who will create a countervailing force against short-term profit incentives.
What Can YOU Do To Prevent a Disaster at Your Own Company?
It’s not enough just to point fingers at BP and PG&E for their specific failures. Clearly, those companies need to do more than they are already to reduce the risk of disasters that will cost innocent lives. As customers and stockholders, we need to hold companies and regulating agencies accountable for managing product risk. But we also need to look in the mirror and consider our own companies and products. Have you and your company done enough to identify risks posed by your products or projects and reduce their risks? Probably not. Take some time and investigate whether you can do better!
Conflict of interest statement: At the time this article was published, the author held a small position in BP stock and held broad-based mutual funds and ETFs that doubtlessly invest in companies mentioned in this article.
voximate
RSS
[...] “Risk Management: Will YOUR Product or Project Have a Disaster Like BP or PG&E?” – Agile Product and Project Management Blog Will your company, product, or project have a major disaster like the BP gulf oil spill, the PG&E gas pipeline explosion, or the Heartland Payment Systems credit card security breach? Have you made the right amount of effort and spent the right amount of money to identify and prevent events that have a low probability of occurring but a high impact if they happen? And how do you know what’s the right amount? Click here to continue reading [...]
[...] discussed risk management, disaster prevention, and the problem of high impact, low-probability events previously in reference to the BP oil spill and the PG&E San Bruno natural gas pipeline [...]