## When more knowledge means more uncertainty – a task correlation paradox and its resolution

### Introduction

Project tasks can have a variety of dependencies. The most commonly encountered ones are task scheduling dependencies such as finish-to-start and start-to-start relationships which are available in many scheduling tools. However, other kinds of dependencies are possible too. For example, it can happen that the *durations *of two tasks are correlated in such a way that if one task takes longer or shorter than average, then so does the other. [Note: In statistics such a relationship between two quantities is called a *positive *correlation and an inverse relationship is termed a negative correlation]. In the absence of detailed knowledge of the relationship, one can model such duration dependencies through statistical correlation coefficients. In my previous post, I showed – via Monte Carlo simulations – that the *uncertainty in the duration of a project increases if project task durations are positively correlated* (the increase in uncertainty being relative to the uncorrelated case). At first sight this is counter-intuitive, even paradoxical. Knowing that tasks are correlated essentially amounts to more knowledge about the tasks as compared to the uncorrelated case. More knowledge should equate to less uncertainty, so one would expect the uncertainty to decrease compared to the uncorrelated case. This post discusses the paradox and its resolution using the example presented in the previous post.

I’ll begin with a brief recapitulation of the main points of the previous post and then discuss the paradox in some detail.

### The example and the paradox

The “project” that I simulated consisted of two identical, triangularly distributed tasks performed sequentially. The triangular distribution for each of the tasks had the following parameters: minimum, most likely and maximum durations of 2, 4 and 8 days respectively. Simulations were carried out for two cases:

- No correlation between the two tasks.
- A correlation coefficient of 0.79 between the two tasks.

The simulations yielded probability distributions for overall completion times for the two cases. I then calculated the standard deviation for both distributions. The standard deviation is a measure of the “spread” or uncertainty represented by a distribution. The standard deviation for the correlated case turned out to be more than 30% larger than that for the uncorrelated case (2.33 and 1.77 days respectively), indicating that the probability distribution for the correlated case has a much wider spread than that for the uncorrelated case. The difference in spread can be seen quite clearly in figure 5 of my previous post, which depicts the frequency histograms for the two simulations (the frequency histograms are essentially proportional to the probability distribution). Note that the averages for the two cases are 9.34 and 9.32 days – statistically identical, as we might expect, because the tasks are identically distributed.

Why is the uncertainty (as measured by the standard deviation of the distribution) greater in the correlated case?

Here’s a brief explanation why. In the uncorrelated case, the outcome of the first task has no bearing on the outcome of the second. So if the first task takes longer than the average time (or more precisely, median time), the second one would have an even chance of finishing before the average time of the distribution. There is, therefore, a good chance in the uncorrelated case that overruns (underruns) in the first task will be cancelled out by underruns (overruns) in the second. This is essentially why the combined distribution for the uncorrelated case is more symmetric than that of the correlated case (see figure 5 of the previous post). In the correlated case, however, if the first task takes longer than the median time, chances are that the second task will take longer than the median too (with a similar argument holding for shorter-than-median times). The second task thus has an effect of amplifying the outcome of the first task. This effect becomes more pronounced as we move towards the extremes of the distribution, thus making extreme outcomes more likely than in the uncorrelated case. This has the effect of broadening the combined probability distribution – and hence the larger standard deviation.

Now, although the above explanation is technically correct, the sense that something’s not quite right remains: *how can it be that knowing more about the tasks that make up a project results in increased overall uncertainty*?

### Resolving the paradox

The key to resolving the paradox lies in looking at the situation after task A has completed but B is yet to start. Let’s look at this in some detail.

Consider the uncorrelated case first. The two tasks are independent, so after A completes, we still know nothing more about the possible duration of B other than that it is triangularly distributed with min, max and most likely times of 2, 4 and 8 days. In the correlated case, however, the duration of B tracks the duration of A – that is, if A takes a long (or short) time then so will B. So, after A has completed, we have a pretty good idea of how long B will take. Our knowledge of the correlation works to reduce the uncertainty in B *– but only after A is done. *

One can also frame the argument in terms of conditional probability.

In the uncorrelated case, the probability distribution of B – let’s call it p(B) – is independent of A. So the conditional probability of B given that A has already finished (often denoted as P(B|A)) is identical to P(B). That is, there is no change in our knowledge of B after A has completed. Remember that we know p(B) – it is a triangular distribution with min, max and most likely completion times of 2, 4 and 8 days respectively. In the correlated case, however, P(B|A) is not the same as P(B) – the knowledge that A has completed has a huge bearing on the distribution of B. Even if one does not know the conditional distribution of B, one *can *say with some certainty that outcomes close to the duration of A are very likely, and outcomes substantially different from A are highly unlikely. The degree of “unlikeliness” – and the consequent shape of the distribution – depends on the value of the correlation coefficient.

### Endnote

So we see that, on the one hand, *positive correlations between tasks increase uncertainty in the overall duration of the two tasks*. This happens because a wider range of outcomes are possible when the tasks are correlated. On the other hand *knowledge of the correlation can also reduce uncertainty – but only after one of the correlated tasks is done*. There is no paradox here, its all a question of where we are on the project timeline.

Of course, one can argue that the paradox is an artefact of the assumption that the two tasks remain triangularly distributed in the correlated case. It is far from obvious that this assumption is correct, and it is hard to validate in the real world. That said, I should add that most commercially available simulation tools treat correlations in much the same way as I have done in my previous post – see this article from the @Risk knowledge base, for example.

In the end, though, even if the paradox is only an artefact of modelling and has no real world application, it is still a good pedagogic example of how probability distributions can combine to give counter-intuitive results.

**Acknowledgement**:

Thanks to Vlado Bokan for several interesting conversations relating to this paradox.

K,

The dependency issue is why the US Defense Contract Management Agency (DCMA) prohibits (or at least strongly discourages) the use of anything other than FS.

As well no leads or lags beyond 5 working days are allowed. For a large program, the 5 days is the same as 0 days.

Then they “strongly suggest” that all constraints are ASAP, with only one MSO – the Authorization to Proceed (ATP) of Contract Award (CA).

With this paradigm, you get a clean Integrated Master Schedule (IMS), where the critical path is actually the “real” one.

The next concept is to use the Integrated Master Plan architecture. You can find the source materials on Google. The IMP and the supporting IMS is a vertically integrated network of activities. The path from the Task (or Work Packages) to the Accomplishment Criteria, to the Significant Accomplishments, landing on the Program Event isolated the interdependencies and unfavorable correlations between the work stream.

As you so rightly suggest the Triangle distribution is the starting point. In Risk+ the correlation between activities is a report from the simulation run. In @Risk For Project there are other correlation processes.

This is called crucitiality (sic) and is a measure of the coupling stability of the activity network.

Glen B AllemanDecember 17, 2009 at 11:08 am

Glen,

Thanks for your detailed comments – it is especially interesting to know about scheduling practices that DCMA suggests and the rationale behind them.

As for correlation, in my (limited) experience, many models that incorporate correlations between task durations (or any other project variables) use simplistic assumptions. For example, using a Pearson (product-moment) correlation coefficient (as I have done above and in my previous post) boils down to assuming a linear relationship. The real dependence may well be nonlinear. If this is so simulations made using a single correlation coefficient can be very misleading. Problem is, it is hard to model correlations correctly, particularly given that reliable historical data is hard to come by. Even where historical data is available, inferring the dependence isn’t easy. Which is why most folks (and tools) model correlations using a linear approach.

Regardless of the real-world utility of the technique, however, I think the paradox is a nice illustration of some of the surprises that lurk within simulations. If nothing, it highlights the importance of understanding mechanisms behind the models that tools make available to us.

Regards,

Kailash.

KDecember 17, 2009 at 5:27 pm

I believe there’s an alternative resolution to this paradox:

The scenario compares *known* high (positive) correlation between the duration of two tasks, to *known* zero- or low-correlation between task durations.

That is, this is not, in fact, a case of “more knowledge” in one case leading to less certainty, but equivalent knowledge about different situations.

Contrast this with a situation where you have *unknown* correlation between the duration of two tasks: they could be perfectly correlated, perfectly inverse, or anywhere in between, yet the probability distribution is assumed to match the case where the two are independent – and, from this perspective, it is far from clear that this is an accurate assumption.

Stephen VorisMay 9, 2014 at 4:28 am