When more knowledge means more uncertainty – a task correlation paradox and its resolution
Project tasks can have a variety of dependencies. The most commonly encountered ones are task scheduling dependencies such as finish-to-start and start-to-start relationships which are available in many scheduling tools. However, other kinds of dependencies are possible too. For example, it can happen that the durations of two tasks are correlated in such a way that if one task takes longer or shorter than average, then so does the other. [Note: In statistics such a relationship between two quantities is called a positive correlation and an inverse relationship is termed a negative correlation]. In the absence of detailed knowledge of the relationship, one can model such duration dependencies through statistical correlation coefficients. In my previous post, I showed – via Monte Carlo simulations – that the uncertainty in the duration of a project increases if project task durations are positively correlated (the increase in uncertainty being relative to the uncorrelated case). At first sight this is counter-intuitive, even paradoxical. Knowing that tasks are correlated essentially amounts to more knowledge about the tasks as compared to the uncorrelated case. More knowledge should equate to less uncertainty, so one would expect the uncertainty to decrease compared to the uncorrelated case. This post discusses the paradox and its resolution using the example presented in the previous post.
I’ll begin with a brief recapitulation of the main points of the previous post and then discuss the paradox in some detail.
The example and the paradox
The “project” that I simulated consisted of two identical, triangularly distributed tasks performed sequentially. The triangular distribution for each of the tasks had the following parameters: minimum, most likely and maximum durations of 2, 4 and 8 days respectively. Simulations were carried out for two cases:
- No correlation between the two tasks.
- A correlation coefficient of 0.79 between the two tasks.
The simulations yielded probability distributions for overall completion times for the two cases. I then calculated the standard deviation for both distributions. The standard deviation is a measure of the “spread” or uncertainty represented by a distribution. The standard deviation for the correlated case turned out to be more than 30% larger than that for the uncorrelated case (2.33 and 1.77 days respectively), indicating that the probability distribution for the correlated case has a much wider spread than that for the uncorrelated case. The difference in spread can be seen quite clearly in figure 5 of my previous post, which depicts the frequency histograms for the two simulations (the frequency histograms are essentially proportional to the probability distribution). Note that the averages for the two cases are 9.34 and 9.32 days – statistically identical, as we might expect, because the tasks are identically distributed.
Why is the uncertainty (as measured by the standard deviation of the distribution) greater in the correlated case?
Here’s a brief explanation why. In the uncorrelated case, the outcome of the first task has no bearing on the outcome of the second. So if the first task takes longer than the average time (or more precisely, median time), the second one would have an even chance of finishing before the average time of the distribution. There is, therefore, a good chance in the uncorrelated case that overruns (underruns) in the first task will be cancelled out by underruns (overruns) in the second. This is essentially why the combined distribution for the uncorrelated case is more symmetric than that of the correlated case (see figure 5 of the previous post). In the correlated case, however, if the first task takes longer than the median time, chances are that the second task will take longer than the median too (with a similar argument holding for shorter-than-median times). The second task thus has an effect of amplifying the outcome of the first task. This effect becomes more pronounced as we move towards the extremes of the distribution, thus making extreme outcomes more likely than in the uncorrelated case. This has the effect of broadening the combined probability distribution – and hence the larger standard deviation.
Now, although the above explanation is technically correct, the sense that something’s not quite right remains: how can it be that knowing more about the tasks that make up a project results in increased overall uncertainty?
Resolving the paradox
The key to resolving the paradox lies in looking at the situation after task A has completed but B is yet to start. Let’s look at this in some detail.
Consider the uncorrelated case first. The two tasks are independent, so after A completes, we still know nothing more about the possible duration of B other than that it is triangularly distributed with min, max and most likely times of 2, 4 and 8 days. In the correlated case, however, the duration of B tracks the duration of A – that is, if A takes a long (or short) time then so will B. So, after A has completed, we have a pretty good idea of how long B will take. Our knowledge of the correlation works to reduce the uncertainty in B – but only after A is done.
One can also frame the argument in terms of conditional probability.
In the uncorrelated case, the probability distribution of B – let’s call it p(B) – is independent of A. So the conditional probability of B given that A has already finished (often denoted as P(B|A)) is identical to P(B). That is, there is no change in our knowledge of B after A has completed. Remember that we know p(B) – it is a triangular distribution with min, max and most likely completion times of 2, 4 and 8 days respectively. In the correlated case, however, P(B|A) is not the same as P(B) – the knowledge that A has completed has a huge bearing on the distribution of B. Even if one does not know the conditional distribution of B, one can say with some certainty that outcomes close to the duration of A are very likely, and outcomes substantially different from A are highly unlikely. The degree of “unlikeliness” – and the consequent shape of the distribution – depends on the value of the correlation coefficient.
So we see that, on the one hand, positive correlations between tasks increase uncertainty in the overall duration of the two tasks. This happens because a wider range of outcomes are possible when the tasks are correlated. On the other hand knowledge of the correlation can also reduce uncertainty – but only after one of the correlated tasks is done. There is no paradox here, its all a question of where we are on the project timeline.
Of course, one can argue that the paradox is an artefact of the assumption that the two tasks remain triangularly distributed in the correlated case. It is far from obvious that this assumption is correct, and it is hard to validate in the real world. That said, I should add that most commercially available simulation tools treat correlations in much the same way as I have done in my previous post – see this article from the @Risk knowledge base, for example.
In the end, though, even if the paradox is only an artefact of modelling and has no real world application, it is still a good pedagogic example of how probability distributions can combine to give counter-intuitive results.
Thanks to Vlado Bokan for several interesting conversations relating to this paradox.