On the limitations of scoring methods for risk analysis
A couple of months ago I wrote an article highlighting some of the pitfalls of using risk matrices. Risk matrices are an example of scoring methods , techniques which use ordinal scales to assess risks. In these methods, risks are ranked by some predefined criteria such as impact or expected loss, and the ranking is then used as the basis for decisions on how the risks should be addressed. Scoring methods are popular because they are easy to use. However, as Douglas Hubbard points out in his critique of current risk management practices, many commonly used scoring techniques are flawed. This post – based on Hubbard’s critique and research papers quoted therein – is a brief look at some of the flaws of risk scoring techniques.
Commonly used risk scoring techniques and problems associated with them
Scoring techniques fall under two major categories:
- Weighted scores: These use several ordered scales which are weighted according to perceived importance. For example: one might be asked to rate financial risk, technical risk and organisational risk on a scale of 1 to 5 for each, and then weight then by factors of 0.6, 0.3 and 0.1 respectively (possibly because the CFO – who happens to be the project sponsor – is more concerned about financial risk than any other risks ). The point is, the scores and weights assigned can be highly subjective – more on that below.
- Risk matrices: These rank risks along two dimensions – probability and impact – and assign them a qualitative ranking of high, medium or low depending on where they fall. Cox’s theorem shows such categorisations are internally inconsistent because the category boundaries are arbitrarily chosen.
Hubbard makes the point that, although both the above methods are endorsed by many standards and methodologies (including those used in project management), they should be used with caution because they are flawed. To quote from his book:
Together these ordinal/scoring methods are the benchmark for the analysis of risks and/or decisions in at least some component of most large organizations. Thousands of people have been certified in methods based in part on computing risk scores like this. The major management consulting firms have influenced virtually all of these standards. Since what these standards all have in common is the used of various scoring schemes instead of actual quantitative risk analysis methods, I will call them collectively the “scoring methods.” And all of them, without exception, are borderline or worthless. In practices, they may make many decisions far worse than they would have been using merely unaided judgements.
What is the basis for this claim? Hubbard points to the following:
- Scoring methods do not make any allowance for flawed perceptions of analysts who assign scores – i.e. they do not consider the effect of cognitive bias. I won’t dwell on this as I have previously written about the effect of cognitive biases in project risk management -see this post and this one, for example.
- Qualitative descriptions assigned to each score are understood differently by different people. Further, there is rarely any objective guidance as to how an analyst is to distinguish between a high or medium risk. Such advice may not even help: research by Budescu, Broomell and Po shows that there can be huge variances in understanding of qualitative descriptions, even when people are given specific guidelines what the descriptions or terms mean.
- Scoring methods add their own errors. Below are brief descriptions of some of these:
- In his paper on the risk matrix theorem, Cox mentions that “Typical risk matrices can correctly and unambiguously compare only a small fraction (e.g., less than 10%) of randomly selected pairs of hazards. They can assign identical ratings to quantitatively very different risks.” He calls this behaviour “range compression” – and it applies to any scoring technique that uses ranges.
- Assigned scores tend to cluster around the mid-low high range. Analysis by Hubbard shows that, on a 5 point scale, 75% of all responses are 3 or 4. This implies that changing a score from 3 to 4 or vice-versa can have a disproportionate effect on classification of risks.
- Scores implicitly assume that the magnitude of the quantity being assumed is directly proportional to the scale. For example, a score of 2 implies that the criterion being measured is twice as large as it would be for a score of 1. However, in reality, criteria are rarely linear as implied by such a scale.
- Scoring techniques often presume that the factors being scored are independent of each other – i.e. there are no correlations between factors. This assumption is rarely tested or justified in any way.
Many project management standards advocate the use of scoring techniques. To be fair, in many situations they are adequate as long as they are used with an understanding of their limitations. Seen in this light, Hubbard’s book is an admonition to standards and textbook writers to be more critical of the methods they advocate, and a warning to practitioners that an uncritical adherence to standards and best practices is not the best way to manage project risks .
Scoring done right
Just to be clear, Hubbard’s criticism is directed against scoring methods that use arbitrary, qualitative scales which are not justified by independent analysis. There are other techniques which, though superficially similar to these flawed scoring methods, are actually quite robust because they are:
- Based on observations.
- Use real measures (as opposed to arbitrary ones – such as “alignment with business objectives” on a scale of 1 to 5, without defining what “alignment” means.)
- Validated after the fact (and hence refined with use).
As an example of a sound scoring technique, Hubbard quotes this paper by Dawes, which presents evidence that linear scoring models are superior to intuition in clinical judgements. Strangely, although the weights themselves can be obtained through intuition, the scoring model outperforms clinical intuition. This happens because human intuition is good at identifying important factors, but not so hot at evaluating the net effect of several, possibly competing factors. Hence simple linear scoring models can outperform intuition. The key here is that the models are validated by checking the predictions against reality.
Another class of techniques use axioms based on logic to reduce inconsistencies in decisions. An example of such a technique is multi-attribute utility theory. Since they are based on logic, these methods can also be considered to have a solid foundation unlike those discussed in the previous section.
Many commonly used scoring methods in risk analysis are based on flaky theoretical foundations – or worse, none at all. To compound the problem, they are often used without any validation. A particularly ubiquitous example is the well-known and loved risk matrix. In his paper on risk matrices, Tony Cox shows how risk matrices can sometimes lead to decisions that are worse than those made on the basis of a coin toss. The fact that this is a possibility – even if only a small one – should worry anyone who uses risk matrices (or other flawed scoring techniques) without an understanding of their limitations.