Archive for October 2009
One of the questions that organisations grapple with is whether or not to outsource software development work to external providers. The work of Oliver Williamson – one of the 2009 Nobel Laureates for Economics – provides some insight into this issue. This post is a brief look at how Williamson’s work on transaction cost economics can be applied to the question of outsourcing.
A firm has two choices for any economic activity: performing the activity in-house or going to market. In either case, the cost of the activity can be decomposed into production costs, which are direct and indirect costs of producing the good or service, and transaction costs, which are other (indirect) costs incurred in performing the economic activity.
In the case of in-house application development, production costs include developer time, software tools etc whereas transaction costs include costs relating to building an internal team (with the right skills, attitude and knowledge) and managing uncertainty. On the other hand, in outsourced application development, production costs include all costs that the vendor incurs in producing the application whereas transaction costs (typically incurred by the client) include the following:
- Search costs: cost of searching for providers of the product / service.
- Selection costs: cost of selecting a specific vendor.
- Bargaining costs: costs incurred in agreeing on an acceptable price.
- Enforcement costs: costs of measuring compliance, costs of enforcing the contract etc.
- Costs of coordinating work : this includes costs of managing the vendor.
From the above list it is clear that it can be hard to figure out transaction costs for outsourcing.
Now, according to Williamson, the decision as to whether or not an economic activity should be outsourced depends critically on transaction costs. To quote from an article in the Economist which describes his work:
…All economic transactions are costly-even in competitive markets, there are costs associated with figuring out the right price. The most efficient institutional arrangement for carrying out a particular economic activity would be the one that minimized transaction costs.
The most efficient institutional arrangement is often the market (i.e. outsourcing, in the context of this post), but firms (i.e. in-house IT arrangements) are sometimes better.
So, when are firms better?
Williamson’s work provides an answer to this question. He argues that the cost of completing an economic transaction in an open market:
- Increases with the complexity of the transaction (implementing an ERP system is more complex than implementing a new email system).
- Increases if it involves assets that are worth more within a relationship between two parties than outside of it: for example, custom IT services, tailored to the requirements of a specific company have more value to the two parties – provider and client – than to anyone else. This is called asset specificity in economic theory
These features make it difficult if not impossible to write and enforce contracts that take every eventuality into account. To quote from Williamson (2002):
…. all complex contracts are unavoidably incomplete, on which account the parties will be confronted with the need to adapt to unanticipated disturbances that arise by reason of gaps, errors, and omissions in the original contract….
Why are complex contracts necessarily incomplete?
Well, there are at least a couple of reasons:
- Bounds on human rationality: basically, no one can foresee everything, so contracts inevitably omit important eventualities.
- Strategic behavior: This refers to opportunistic behavior to gain advantage over the other party. This might be manifested as a refusal to cooperate or a request to renegotiate the contract.
Contracts will therefore work only if interpreted in a farsighted manner, with disputes being settled directly between the vendor and client. As Williamson states in this paper:
…important to the transaction-cost economics enterprise is the assumption that contracts, albeit incomplete, are interpreted in a farsighted manner, according to which economic actors look ahead, perceive potential hazards and embed transactions in governance structures that have hazard-mitigating purpose and effect. Also, most of the governance action works through private ordering with courts being reserved for purposes of ultimate appeal.
At some point this becomes too hard to do. In such situations it makes sense to carry out the transaction within a single legal entity (i.e. within a firm) rather than on the open market. This shouldn’t be surprising: it is obvious that complex transactions will be simplified if they take place within a single governance structure.
The above has implications for both clients and providers in outsourcing arrangements. From the client perspective, when contracts for IT services are hard to draw up and enforce, it may be better to have those services provided by in-house departments rather than external vendors. On the other hand, vendors need to focus on keeping contracts as unambiguous and transparent as possible. Finally, both clients and vendors should expect ambiguities and omissions in contracts, and be flexible whenever there are disagreements over the interpretation of contract terms.
The key takeaway is easy to summarise: be sure to consider transaction costs when you are making a decision on whether or not to outsource development work.
It would have been a couple of weeks after the kit tracking system was released that Therese called Mike to report the problem.
“How’re you going, Mike?” She asked, and without waiting to hear his reply, continued, “I’m at a site doing kit allocations and I can’t find the screen that will let me allocate sub-kits.”
“What’s a sub-kit?” Mike was flummoxed; it was the first time he’d heard the term. It hadn’t come up during any of the analysis sessions, tests, or any of the countless conversations he’d had with end-users during development.
“Well, we occasionally have to break open kits and allocate different parts of it to different sites,” said Therese. “When this happens, we need to keep track of which site has which part.”
“Sorry Therese, but this never came up during any of the requirements sessions, so there is no screen.”
“What do I do? I have to record this somehow.” She was upset, and understandably so.
“Look,” said Mike, “could you make a note of the sub-kit allocations on paper – or better yet, in Excel?
“Yeah, I could do that if I have to.”
“Great. Just be sure to record all the kit identifier and which part of the kit is allocated to which site. We’ll have a chat about the sub-kit allocation process when you are back from your site visit. Once I understand the process, I should be able to have it programmed in a couple of days. When will you be back?”
“Tomorrow,” said Therese.
“OK, I’ll book something for tomorrow afternoon.”
The conversation concluded with the usual pleasantries.
After Mike hung up he wondered how they could have missed such an evidently important requirement. The application had been developed in close consultation with users. The requirements sessions had involved more than half the user community. How had they forgotten to mention such an important requirement and, more important, how had he and the other analyst not asked the question, “Are kits ever divided up between sites?”
Mike and Therese had their chat the next day. As it turned out, Mike’s off-the-cuff estimate was off by a long way. It took him over a week to add in the sub-kit functionality, and another day or so to import all the data that users had entered in Excel (and paper!) whilst the screens were being built.
The missing requirement turned out to be a pretty expensive omission.
The story of Therese and Mike may ring true with those who are involved with software development. Gathering requirements is an error prone process: users forget to mention things, and analysts don’t always ask the right questions. This is one reason why iterative development is superior to BDUF approaches: the former offers many more opportunities for interaction between users and analysts, and hence many more opportunities to catch those elusive requirements.
Yet, although Mike had used a joint development approach, with plenty of interaction between users and developers, this important requirement had been overlooked.
Further, as Mike’s experience corroborates, fixing issues associated with missing requirements can be expensive.
Fact 25 in the book goes: Missing requirements are the hardest requirements errors to correct.
In his discussion of the above, Glass has this to say:
Why are missing requirements so devastating to problem solution? Because each requirement contributes to the level of difficulty of solving a problem, and the interaction among all those requirements quickly escalates the complexity of the problem’s solution. The omission of one requirement may balloon into failing to consider a whole host of problems in designing a solution.
Of course, by definition, missing requirements are hard to test for. Glass continues:
Why are missing requirements hard to detect and correct? Because the most basic portion of the error removal process in software is requirements-driven. We define test cases to verify that each requirement in the problem solution has been satisfied. If a requirement is not present, it will not appear in the specification and, therefore, will not be checked during any of the specification-driven reviews or inspections; further there will be no test cases built to verify its satisfaction. Thus the most basic error removal approaches will fail to detect its absence.
As a corollary to the above fact, Glass states that:
The most persistent software errors – those that escape the testing process and persist into the production version of the software – are errors of omitted logic. Missing requirements result in omitted logic.
In his research, Glass found that 30% of persistent errors were errors of omitted logic! It is pretty clear why these errors persist – because it is difficult to test for something that isn’t there. In the story above, the error would have remained undetected until someone needed to allocate sub-kits – something not done very often. This is probably why Therese and other users forgot to mention it. Why the analysts didn’t ask is another question: it is their job to ask questions that will catch such elusive requirements. And before Mike reads this and cries foul, I should admit that I was the other analyst on the project, and I have absolutely no defence to offer.
A couple of months ago I wrote an article highlighting some of the pitfalls of using risk matrices. Risk matrices are an example of scoring methods , techniques which use ordinal scales to assess risks. In these methods, risks are ranked by some predefined criteria such as impact or expected loss, and the ranking is then used as the basis for decisions on how the risks should be addressed. Scoring methods are popular because they are easy to use. However, as Douglas Hubbard points out in his critique of current risk management practices, many commonly used scoring techniques are flawed. This post – based on Hubbard’s critique and research papers quoted therein – is a brief look at some of the flaws of risk scoring techniques.
Commonly used risk scoring techniques and problems associated with them
Scoring techniques fall under two major categories:
- Weighted scores: These use several ordered scales which are weighted according to perceived importance. For example: one might be asked to rate financial risk, technical risk and organisational risk on a scale of 1 to 5 for each, and then weight then by factors of 0.6, 0.3 and 0.1 respectively (possibly because the CFO – who happens to be the project sponsor – is more concerned about financial risk than any other risks ). The point is, the scores and weights assigned can be highly subjective – more on that below.
- Risk matrices: These rank risks along two dimensions – probability and impact – and assign them a qualitative ranking of high, medium or low depending on where they fall. Cox’s theorem shows such categorisations are internally inconsistent because the category boundaries are arbitrarily chosen.
Hubbard makes the point that, although both the above methods are endorsed by many standards and methodologies (including those used in project management), they should be used with caution because they are flawed. To quote from his book:
Together these ordinal/scoring methods are the benchmark for the analysis of risks and/or decisions in at least some component of most large organizations. Thousands of people have been certified in methods based in part on computing risk scores like this. The major management consulting firms have influenced virtually all of these standards. Since what these standards all have in common is the used of various scoring schemes instead of actual quantitative risk analysis methods, I will call them collectively the “scoring methods.” And all of them, without exception, are borderline or worthless. In practices, they may make many decisions far worse than they would have been using merely unaided judgements.
What is the basis for this claim? Hubbard points to the following:
- Scoring methods do not make any allowance for flawed perceptions of analysts who assign scores – i.e. they do not consider the effect of cognitive bias. I won’t dwell on this as I have previously written about the effect of cognitive biases in project risk management -see this post and this one, for example.
- Qualitative descriptions assigned to each score are understood differently by different people. Further, there is rarely any objective guidance as to how an analyst is to distinguish between a high or medium risk. Such advice may not even help: research by Budescu, Broomell and Po shows that there can be huge variances in understanding of qualitative descriptions, even when people are given specific guidelines what the descriptions or terms mean.
- Scoring methods add their own errors. Below are brief descriptions of some of these:
- In his paper on the risk matrix theorem, Cox mentions that “Typical risk matrices can correctly and unambiguously compare only a small fraction (e.g., less than 10%) of randomly selected pairs of hazards. They can assign identical ratings to quantitatively very different risks.” He calls this behaviour “range compression” – and it applies to any scoring technique that uses ranges.
- Assigned scores tend to cluster around the mid-low high range. Analysis by Hubbard shows that, on a 5 point scale, 75% of all responses are 3 or 4. This implies that changing a score from 3 to 4 or vice-versa can have a disproportionate effect on classification of risks.
- Scores implicitly assume that the magnitude of the quantity being assumed is directly proportional to the scale. For example, a score of 2 implies that the criterion being measured is twice as large as it would be for a score of 1. However, in reality, criteria are rarely linear as implied by such a scale.
- Scoring techniques often presume that the factors being scored are independent of each other – i.e. there are no correlations between factors. This assumption is rarely tested or justified in any way.
Many project management standards advocate the use of scoring techniques. To be fair, in many situations they are adequate as long as they are used with an understanding of their limitations. Seen in this light, Hubbard’s book is an admonition to standards and textbook writers to be more critical of the methods they advocate, and a warning to practitioners that an uncritical adherence to standards and best practices is not the best way to manage project risks .
Scoring done right
Just to be clear, Hubbard’s criticism is directed against scoring methods that use arbitrary, qualitative scales which are not justified by independent analysis. There are other techniques which, though superficially similar to these flawed scoring methods, are actually quite robust because they are:
- Based on observations.
- Use real measures (as opposed to arbitrary ones – such as “alignment with business objectives” on a scale of 1 to 5, without defining what “alignment” means.)
- Validated after the fact (and hence refined with use).
As an example of a sound scoring technique, Hubbard quotes this paper by Dawes, which presents evidence that linear scoring models are superior to intuition in clinical judgements. Strangely, although the weights themselves can be obtained through intuition, the scoring model outperforms clinical intuition. This happens because human intuition is good at identifying important factors, but not so hot at evaluating the net effect of several, possibly competing factors. Hence simple linear scoring models can outperform intuition. The key here is that the models are validated by checking the predictions against reality.
Another class of techniques use axioms based on logic to reduce inconsistencies in decisions. An example of such a technique is multi-attribute utility theory. Since they are based on logic, these methods can also be considered to have a solid foundation unlike those discussed in the previous section.
Many commonly used scoring methods in risk analysis are based on flaky theoretical foundations – or worse, none at all. To compound the problem, they are often used without any validation. A particularly ubiquitous example is the well-known and loved risk matrix. In his paper on risk matrices, Tony Cox shows how risk matrices can sometimes lead to decisions that are worse than those made on the basis of a coin toss. The fact that this is a possibility – even if only a small one – should worry anyone who uses risk matrices (or other flawed scoring techniques) without an understanding of their limitations.