Eight to Late

Sensemaking and Analytics for Organizations

Archive for the ‘Monte Carlo Simulation’ Category

A gentle introduction to Monte Carlo simulation for project managers

with 5 comments

This article covers the why, what and how of Monte Carlo simulation using a canonical example from project management –  estimating the duration of a small project. Before starting, however, I’d like say a few words about the tool I’m going to use.

Despite the bad rap spreadsheets get from tech types – and I have to admit that many of their complaints are justified – the fact is, Excel remains one of the most ubiquitous “computational” tools in the corporate world. Most business professionals would have used it at one time or another. So, if you you’re a project manager and want the rationale behind your estimates to be accessible to the widest possible audience, you are probably better off presenting them in Excel than in SPSS, SAS, Python, R or pretty much anything else. Consequently, the tool I’ll use in this article is Microsoft Excel. For those who know about Monte Carlo and want to cut to the chase, here’s the Excel workbook containing all the calculations detailed in later sections. However, if you’re unfamiliar with the technique, you may want to have a read of the article before playing with the spreadsheet.

In keeping with the format of the tutorials on this blog, I’ve assumed very little prior knowledge about probability, let alone Monte Carlo simulation. Consequently, the article is verbose and the tone somewhat didactic.

Introduction

Estimation is key part of a project manager’s role. The most frequent (and consequential) estimates they are asked deliver relate to time and cost.  Often these are calculated and presented as point estimates: i.e. single numbers – as in, this task will take 3 days. Or, a little better, as two-point ranges – as in, this task will take between 2 and 5 days.  Better still, many use a PERT-like approach wherein estimates are based on 3 points: best, most likely and worst case scenarios – as in, this task will take between 2 and 5 days, but it’s most likely that we’ll finish on day 3.  We’ll use three-point estimates as a starting point for Monte Carlo simulation, but first, some relevant background.

It is a truism, well borne out by experience, that it is easier to estimate small, simple tasks than large, complex ones. Indeed, this is why one of the early to-dos in a project is the construction of a work breakdown structure. However, a problem arises when one combines the estimates for individual elements into an overall estimate for a project or a phase thereof. It is that a straightforward addition of individual estimates or bounds will almost always lead to a grossly incorrect estimation of overall time or cost. The reason for this is simple: estimates are necessarily based on probabilities and probabilities do not combine additively. Monte Carlo simulation provides a principled and intuitive way to obtain probabilistic estimates at the level of an entire project based on estimates of the individual tasks that comprise it.

The problem

The best way to explain Monte Carlo is through a simple worked example. So, let’s consider a 4 task project shown in Figure 1. In the project, the second task is dependent on the first, and third and fourth are dependent on the second but not on each other. The upshot of this is that the first two tasks have to be performed sequentially and the last two can be done at the same time, but can only be started after the second task is completed.

To summarise: the first two tasks must be done in series and the last two can be done in parallel.

Figure 1; A project with 4 tasks.

Figure 1 also shows the three point estimates for each task – that is the minimum, maximum and most likely completion times. For completeness I’ve listed them below:

  • Task 1 – Min: 2 days; Most Likely: 4 days; Max: 8 days
  • Task 2 – Min: 3 days; Most Likely: 5 days; Max: 10 days
  • Task 3 – Min: 3 days; Most Likely: 6 days; Max: 9 days
  • Task 4 – Min: 2 days; Most Likely: 4 days; Max: 7 days

OK, so that’s the situation as it is given to us. The first step to  developing  an estimate is to formulate the problem in a way that it can be tackled using Monte Carlo simulation. This bring us to the important topic of the shape of uncertainty aka probability distributions.

The shape of uncertainty

Consider the data for Task 1. You have been told that it most often finishes on day 4.  However, if things go well, it could take as little as 2 days; but if things go badly it could take as long as 8 days.  Therefore, your range of possible finish times (outcomes) is between 2 to 8 days.

Clearly, each of these outcomes is not equally likely.  The most likely outcome is that you will finish the task in 4 days (from what your team member has told you). Moreover, the likelihood of finishing in less than 2 days or more than 8 days is zero. If we plot the likelihood of completion against completion time, it would look something like Figure 2.

Figure 2: Likelihood of finishing on day 2, day 4 and day 8.

Figure 2 begs a couple of questions:

  1. What are the relative likelihoods of completion for all intermediate times – i.e. those between 2 to 4 days and 4 to 8 days?
  2. How can one quantify the likelihood of intermediate times? In other words, how can one get a numerical value of the likelihood for all times between 2 to 8 days?  Note that we know from the earlier discussion that this must be zero for any time less than 2 or greater than 8 days.

The two questions are actually related. As we shall soon see, once we know the relative likelihood of completion at all times (compared to the maximum), we can work out its numerical value.

Since we don’t know anything about intermediate times (I’m assuming there is no other historical data available), the simplest thing to do is to assume that the likelihood increases linearly (as a straight line) from 2 to 4 days and decreases in the same way from 4 to 8 days as shown in Figure 3. This gives us the well-known triangular distribution.

Jargon Buster: The term distribution is simply a fancy word for a plot of likelihood vs. time.

Figure 3: Triangular distribution fitted to points in Figure 1

Of course, this isn’t the only possibility; there are an infinite number of others. Figure 4 is another (admittedly weird) example.

Figure 4: Another distribution that fits the points in Figure 2.

Further, it is quite possible that the upper limit (8 days) is not a hard one. It may be that in exceptional cases the task could take much longer (for example, if your team member calls in sick for two weeks) or even not be completed at all (for example, if she then leaves for that mythical greener pasture).  Catering for the latter possibility, the shape of the likelihood might resemble Figure 5.

Figure 5: A distribution that allows for a very long (potentially) infinite completion time

The main takeaway from the above is that uncertainties should be expressed as shapes rather than numbers, a notion popularised by Sam Savage in his book, The Flaw of Averages.

[Aside:  you may have noticed that all the distributions shown above are skewed to the right – that  is they have a long tail. This is a general feature of distributions that describe time (or cost) of project tasks. It would take me too far afield to discuss why this is so, but if you’re interested you may want to check out my post on the inherent uncertainty of project task estimates.

From likelihood to probability

Thus far, I have used the word “likelihood” without bothering to define it.  It’s time to make the notion more precise.  I’ll begin by asking the question: what common sense properties do we expect a quantitative measure of likelihood to have?

Consider the following:

  1. If an event is impossible, its likelihood should be zero.
  2. The sum of likelihoods of all possible events should equal complete certainty. That is, it should be a constant. As this constant can be anything, let us define it to be 1.

In terms of the example above, if we denote time by t and the likelihood by P(t)  then:

P(t) = 0 for t< 2 and  t> 8

And

\sum_{t}P(t) = 1 where 2\leq t< 8

Where \sum_{t} denotes the sum of all non-zero likelihoods – i.e. those that lie between 2 and 8 days. In simple terms this is the area enclosed by the likelihood curves and the x axis in figures 2 to 5.  (Technical Note:  Since t is a continuous variable, this should be denoted by an integral rather than a simple sum, but this is a technicality that need not concern us here)

P(t) is , in fact, what mathematicians call probability– which explains why I have used the symbol P rather than L. Now that I’ve explained what it  is, I’ll use the word “probability” instead of ” likelihood” in the remainder of this article.

With these assumptions in hand, we can now obtain numerical values for the probability of completion for all times between 2 and 8 days. This can be figured out by noting that the area under the probability curve (the triangle in figure 3 and the weird shape in figure 4) must equal 1, and we’ll do this next.  Indeed, for the problem at hand, we’ll assume that all four task durations can be fitted to triangular distributions. This is primarily to keep things  simple. However, I should emphasise that you can use any shape so long as you can express it mathematically, and I’ll say more about this towards the end of this article.

The triangular distribution

Let’s look at the estimate for Task 1. We have three numbers corresponding to a minimummost likely and maximum time.  To keep the discussion general, we’ll call these t_{min}, t_{ml} and t_{max} respectively, (we’ll get back to our estimator’s specific numbers later).

Now, what about the probabilities associated with each of these times?

Since t_{min} and t_{max} correspond to the minimum and maximum times,  the probability associated with these is zero. Why?  Because if it wasn’t zero, then there would be a non-zero probability of completion for a time less than t_{min} or greater than t_{max} – which isn’t possible [Note: this is a consequence of the assumption that the probability varies continuously –  so if it takes on non-zero value, p_{0},  at t_{min} then it must take on a value slightly less than p_{0} – but greater than 0 –  at t slightly smaller than t_{min} ] .   As far as  the most likely time,  t_{ml},  is concerned:  by definition, the probability attains its highest value at time t_{ml}.    So, assuming the probability can be described by a triangular function, the distribution must have the form shown in Figure 6 below.

Figure 6: Triangular distribution redux.

For the simulation, we need to know the equation describing the above distribution.  Although Wikipedia will tell us the answer in a mouse-click, it is instructive to figure it out for ourselves. First, note that the area under the triangle must be equal to  1 because the task must finish at some time between t_{min} and t_{max}.   As a consequence we have:

\frac{1}{2}\times{base}\times{altitude}=\frac{1}{2}\times{(t_{max}-t_{min})}\times{p(t_{ml})}=1\ldots\ldots{(1)}

where p(t_{ml}) is the probability corresponding to time t_{ml}.  With a bit of rearranging we get,

p(t_{ml})=\frac{2}{(t_{max}-t_{min})}\ldots\ldots(2)

To derive the probability for any time t lying between t_{min} and t_{ml}, we note that:

\frac{(t-t_{min})}{p(t)}=\frac{(t_{ml}-t_{min})}{p(t_{ml})}\ldots\ldots(3)

This is a consequence of the fact that the ratios on either side of equation (3)  are  equal to the slope of the line joining the points (t_{min},0) and (t_{ml}, p(t_{ml})).

Figure 7

Substituting (2) in (3) and simplifying a bit, we obtain:

p(t)=\frac{2(t-t_{min})}{(t_{ml}-t_{min})(t_{max}-t_{min})}\dots\ldots(4) for t_{min}\leq t \leq t_{ml}

In a similar fashion one can show that the probability for times lying between t_{ml} and t_{max} is given by:

p(t)=\frac{2(t_{max}-t)}{(t_{max}-t_{ml})(t_{max}-t_{min})}\dots\ldots(5) for t_{ml}\leq t \leq t_{max}

Equations 4 and 5 together describe the probability distribution function (or PDF)  for all times between t_{min} and t_{max}.

As it turns out, in Monte Carlo simulations, we don’t directly work with the probability distribution function. Instead we work with the cumulative distribution function (or CDF) which is the probability, P,  that the task is completed by time t. To reiterate, the PDF, p(t), is the probability of the task finishing at time t whereas the CDF, P(t), is the probability of the task completing by time t. The CDF, P(t),  is essentially a sum of all probabilities between t_{min} and t. For t < t_{min} this is the area under the triangle with apexes at   (t_{min}, 0), (t, 0) and (t, p(t)).  Using the formula for the area of a triangle (1/2 base times height) and equation (4) we get:

P(t)=\frac{(t-t_{min})^2}{(t_{ml}-t_{min})(t_{max}-t_{min})}\ldots\ldots(6) for t_{min}\leq t \leq t_{ml}

Noting that for t \geq t_{ml}, the area under the curve equals the total area minus the area enclosed by the triangle with base between t and t_{max}, we have:

P(t)=1- \frac{(t_{max}-t)^2}{(t_{max}-t_{ml})(t_{max}-t_{min})}\ldots\ldots(7) for t_{ml}\leq t \leq t_{max}

As expected,  P(t)  starts out with a value 0 at t_{min} and then increases monotonically, attaining a value of 1 at t_{max}.

To end this section let’s plug in the numbers quoted by our estimator at the start of this section: t_{min}=2, t_{ml}=4 and t_{max}=8.  The resulting PDF and CDF are shown in figures 8 and 9.

Figure 8: PDF for triangular distribution (tmin=2, tml=4, tmax=8)

Figure 9 – CDF for triangular distribution (tmin=2, tml=4, tmax=8)

Monte Carlo in a minute

Now with all that conceptual work done, we can get to the main topic of this post:  Monte Carlo estimation. The basic idea behind Monte Carlo is to simulate the entire project (all 4 tasks in this case) a large number N (say 10,000) times and thus obtain N overall completion times.  In each of the N trials, we simulate each of the tasks in the project and add them up appropriately to give us an overall project completion time for the trial.  The resulting N overall completion times will all be different, ranging from the sum of the minimum completion times to the sum of the maximum completion times.  In other words, we will obtain the PDF and CDF for the overall completion time, which will enable us to answer questions such as:

  • How likely is it that the project will be completed within 17 days?
  • What’s the estimated time for which I can be 90% certain that the project will be completed? For brevity, I’ll call this the 90% completion time in the rest of this piece.

“OK, that sounds great”, you say, “but how exactly do we simulate a single task”?

Good question, and I was just about to get to that…

Simulating a single task using the CDF

As we saw earlier, the CDF for the triangular has a S shape and ranges from 0 to 1 in value. It turns out that the S shape is characteristic of all CDFs, regardless of the details underlying PDF. Why? Because, the cumulative probability must lie between 0 and 1 (remember, probabilities can never exceed 1, nor can they be negative).

OK, so to simulate a task, we:

  • generate a random number between 0 and 1, this corresponds to the probability that the task will finish at time t.
  • find the time, t, that this corresponds to this value of probability. This is the completion time for the task for this trial.

Incidentally, this method is called inverse transform sampling.

An example might help clarify how inverse transform sampling works.  Assume that the random number generated is 0.4905. From the CDF for the first task, we see that this value of probability corresponds to a completion time of 4.503 days, which is the completion for this trial (see Figure 10). Simple!

Figure 10: Illustrating inverse transform sampling

In this case we found the time directly from the computed CDF. That’s not too convenient when you’re simulating the project 10,000 times. Instead, we need a programmable math expression that gives us the time corresponding to the probability directly. This can be obtained by solving equations (6) and (7) for t. Some straightforward algebra, yields the following two expressions for t:

t = t_{min} + \sqrt{P(t)(t_{ml} -  t_{min})(t_{max} - t_{min})} \ldots\ldots(8) for t_{min}\leq t \leq t_{ml}

And

t = t_{max} - \sqrt{[1-P(t)](t_{max} -  t_{ml})(t_{max} - t_{min})} \ldots\ldots(9) for t_{ml}\leq t \leq t_{max}

These can be easily combined in a single Excel formula using an IF function, and I’ll show you exactly how in a minute. Yes, we can now finally get down to the Excel simulation proper and you may want to download the workbook if you haven’t done so already.

The simulation

Open up the workbook and focus on the first three columns of the first sheet to begin with. These simulate the first task in Figure 1, which also happens to be the task we have used to illustrate the construction of the triangular distribution as well as the mechanics of Monte Carlo.

Rows 2 to 4 in columns A and B list the min, most likely and max completion times while the same rows in column C list the probabilities associated with each of the times. For t_{min} the probability is 0 and for t_{max} it is 1.  The probability at t_{ml} can be calculated using equation (6) which, for t=t_{max}, reduces to

P(t_{ml}) =\frac{(t_{ml}-t_{min})}{t_{max}-t_{min}}\ldots\ldots(10)

Rows 6 through 10005 in column A are simulated probabilities of completion for Task 1. These are obtained via the Excel RAND() function, which generates uniformly distributed random numbers lying between 0 and 1.  This gives us a list of probabilities corresponding to 10,000 independent simulations of Task 1.

The 10,000 probabilities need to be translated into completion times for the task. This is done using equations (8) or (9) depending on whether the simulated probability is less or greater than P(t_{ml}), which is in cell C3 (and given by Equation (10) above). The conditional statement can be coded in an Excel formula using the IF() function.

Tasks 2-4 are coded in exactly the same way, with distribution parameters in rows 2 through 4 and simulation details in rows 6 through 10005 in the columns listed below:

  • Task 2 – probabilities in column D; times in column F
  • Task 3 – probabilities in column H; times in column I
  • Task 4 – probabilities in column K; times in column L

That’s basically it for the simulation of individual tasks. Now let’s see how to combine them.

For tasks in series (Tasks 1 and 2), we simply sum the completion times for each task to get the overall completion times for the two tasks.  This is what’s shown in rows 6 through 10005 of column G.

For tasks in parallel (Tasks 3 and 4), the overall completion time is the maximum of the completion times for the two tasks. This is computed and stored in rows 6 through 10005 of column N.

Finally, the overall project completion time for each simulation is then simply the sum of columns G and N (shown in column O)

Sheets 2 and 3 are plots of the probability and cumulative probability distributions for overall project completion times. I’ll cover these in the next section.

Discussion – probabilities and estimates

The figure on Sheet 2 of the Excel workbook (reproduced in Figure 11 below) is the probability distribution function (PDF) of completion times. The x-axis shows the elapsed time in days and the y-axis the number of Monte Carlo trials that have a completion time that lie in the relevant time bin (of width 0.5 days). As an example, for the simulation shown in the Figure 11, there were 882 trials (out of 10,000) that had a completion time that lie between 16.25 and 16.75 days. Your numbers will vary, of course, but you should have a maximum in the 16 to 17 day range and a trial number that is reasonably close to the one I got.

Figure 11: Probability distribution of completion times (N=10,000)

I’ll say a bit more about Figure 11 in the next section. For now, let’s move on to Sheet 3 of workbook which shows the cumulative probability of completion by a particular day (Figure 12 below).  The figure shows the cumulative probability function (CDF), which is the sum of all completion times from the earliest possible completion day to the particular day.

Figure 12: Probability of completion by a particular day (N=10,000)

To reiterate a point made earlier,  the reason we work with the CDF  rather than the PDF is that we are interested in knowing the probability of completion by a particular date (e.g. it is 90% likely that we will finish by April 20th) rather than the probability of completion on a particular date (e.g. there’s a 10% chance we’ll finish on April 17th). We can now answer the two questions we posed earlier. As a reminder, they are:

  • How likely is it that the project will be completed within 17 days?
  • What’s the 90% likely completion time?

Both questions are easily answered by using the cumulative distribution chart on Sheet 3 (or Fig 12).  Reading the relevant numbers from the chart, I see that:

  • There’s a 60% chance that the project will be completed in 17 days.
  • The 90% likely completion time is 19.5 days.

How does the latter compare to the sum of the 90% likely completion times for the individual tasks? The 90% likely completion time for a given task can be calculated by solving Equation 9 for $t$, with appropriate values for the parameters t_{min}, t_{max} and t_{ml} plugged in, and P(t) set to 0.9. This gives the following values for the 90% likely completion times:

  • Task 1 – 6.5 days
  • Task 2 – 8.1 days
  • Task 3 – 7.7 days
  • Task 4 – 5.8 days

Summing up the first three tasks (remember, Tasks 3 and 4 are in parallel) we get a total of 22.3 days, which is clearly an overestimation. Now, with the benefit of having gone through the simulation, it is easy to see that the sum of 90% likely completion times for individual tasks does not equal the 90% likely completion time for the sum of the relevant individual tasks – the first three tasks in this particular case. Why? Essentially because a Monte Carlo run in which the first three tasks tasks take as long as their (individual) 90% likely completion times is highly unlikely. Exercise:  use the worksheet to estimate how likely this is.

There’s much more that can be learnt from the CDF. For example, it also tells us that the greatest uncertainty in the estimate is in the 5 day period from ~14 to 19 days because that’s the region in which the probability changes most rapidly as a function of elapsed time. Of course, the exact numbers are dependent on the assumed form of the distribution. I’ll say more about this in the final section.

To close this section, I’d like to reprise a point I mentioned earlier: that uncertainty is a shape, not a number. Monte Carlo simulations make the uncertainty in estimates explicit and can help you frame your estimates in the language of probability…and using a tool like Excel can help you explain these to non-technical people like your manager.

Closing remarks

We’ve covered a fair bit of ground: starting from general observations about how long a task takes, saw how to construct simple probability distributions and then combine these using Monte Carlo simulation.  Before I close, there are a few general points I should mention for completeness…and as warning.

First up, it should be clear that the estimates one obtains from a simulation depend critically on the form and parameters of the distribution used. The parameters are essentially an empirical matter; they should be determined using historical data. The form of the function, is another matter altogether: as pointed out in an earlier section, one cannot determine the shape of a function from a finite number of data points. Instead, one has to focus on the properties that are important. For example, is there a small but finite chance that a task can take an unreasonably long time? If so, you may want to use a lognormal distribution…but remember, you will need to find a sensible way to estimate the distribution parameters from your historical data.

Second, you may have noted from the probability distribution curve (Figure 11)  that despite the skewed distributions of the individual tasks, the distribution of the overall completion time is somewhat symmetric with a minimum of ~9 days, most likely time of ~16 days and maximum of 24 days.  It turns out that this is a general property of distributions that are generated by adding a large number of independent probabilistic variables. As the number of variables increases, the overall distribution will tend to the ubiquitous Normal distribution.

The assumption of independence merits a closer look.  In the case it hand,  it implies that the completion times for each task are independent of each other. As most project managers will know from experience, this is rarely the case: in real life,  a task that is delayed will usually have knock-on effects on subsequent tasks. One can easily incorporate such dependencies in a Monte Carlo simulation. A formal way to do this is to introduce a non-zero correlation coefficient between tasks as I have done here. A simpler and more realistic approach is to introduce conditional inter-task dependencies As an example, one could have an inter-task delay that kicks in only if the predecessor task takes more than 80%  of its maximum time.

Thirdly, you may have wondered why I used 10,000 trials: why not 100, or 1000 or 20,000. This has to do with the tricky issue of convergence. In a nutshell, the estimates we obtain should not depend on the number of trials used.  Why? Because if they did, they’d be meaningless.

Operationally, convergence means that any predicted quantity based on aggregates should not vary with number of trials.  So, if our Monte Carlo simulation has converged, our prediction of 19.5 days for the 90% likely completion time should not change substantially if I increase the number of trials from ten to twenty thousand. I did this and obtained almost the same value of 19.5 days. The average and median completion times (shown in cell Q3 and Q4 of Sheet 1) also remained much the same (16.8 days). If you wish to repeat the calculation, be sure to change the formulas on all three sheets appropriately. I was lazy and hardcoded the number of trials. Sorry!

Finally, I should mention that simulations can be usefully performed at a higher level than individual tasks. In their highly-readable book,  Waltzing With Bears: Managing Risk on Software Projects, Tom De Marco and Timothy Lister show how Monte Carlo methods can be used for variables such as  velocity, time, cost etc.  at the project level as opposed to the task level. I believe it is better to perform simulations at the lowest possible level, the main reason being that it is easier, and less error-prone, to estimate individual tasks than entire projects. Nevertheless, high level simulations can be very useful if one has reliable data to base these on.

There are a few more things I could say about the usefulness of the generated distribution functions and Monte Carlo in general, but they are best relegated to a future article. This one is much too long already and I think I’ve tested your patience enough. Thanks so much for reading, I really do appreciate it and hope that you found it useful.

Acknowledgement: My thanks to Peter Holberton for pointing out a few typographical and coding errors in an earlier version of this article. These have now been fixed. I’d be grateful if readers could bring any errors they find to my attention.

Written by K

March 27, 2018 at 4:11 pm

The drunkard’s dartboard revisited: yet another Excel-based example of Monte Carlo simulation

with 6 comments

(Note: An Excel sheet showing sample calculations and plots discussed in this post can be downloaded here.)

Introduction

Some months ago, I wrote a post explaining the basics of Monte Carlo simulation using the example of a drunkard throwing darts at a board. In that post I assumed that the darts could land anywhere on the dartboard with equal probability. In other words, the hit locations were assumed to be uniformly distributed. In a comment on the piece, George Gkotsis challenged this assumption, arguing that that regardless of the level of inebriation of the thrower, a dart would be more likely to land near the centre of the board than away from it (providing the player is at least moderately skilled). He also suggested using the Normal Distribution to model the spread of hits, with the variance of the distribution serving as a rough measure of the inaccuracy (or drunkenness!) of the drunkard. In George’s words:

I would propose to introduce a ‘skill’ factor, which represents the circle/square ratio (maybe a normal-Gaussian distribution). Of course, this skill factor would be very low (high variance) for a drunken player, but would still take into account the fact that throwing darts into a square is not purely random.

In this post I revisit the drunkard’s dartboard, taking into account George’s suggestions.

Setting the stage

To keep things simple, I’ll make the following assumptions:

Figure 1: The dartboard

  1. The dartboard is a circle of radius 0.5 units centred at the origin (see Figure 1)
  2. The chance of a hit is greatest at the centre of the dartboard and falls off as one moves away from it.
  3. The distribution of hits is a function of distance from the centre but does not depend on direction. In mathematical terms, for a given distance r from the centre of the dartboard, the dart can land at any angle \theta with equal probability, \theta being the angle between the line joining the centre of the board to the dart and the x axis. See Figure 2 for graphical representations of a hit location in terms of r and \theta. Note that that the x and y coordinates can be obtained using the formulas x = r\cos\theta and y= r\sin\theta as s shown in Figure 2.
  4. Hits are distributed according to the Normal distribution with maximum at the centre of the dartboard.
  5. The variance of the Normal distribution is a measure of inaccuracy/drunkenness of the drunkard: the more drunk the drunk, the greater the variation in his aim.

Figure 2: The coordinates of a hit location

These assumptions are consistent with George’s suggestions.

The simulation

[Note to the reader: you may want to download the demo before continuing.]

The steps of a simulation run are as follows:

  1. Generate a number that is normally distributed with a zero mean and a specified standard deviation. This gives the distance, r, of a randomly thrown dart from the centre of the board for a player with a “inaccuracy factor” represented by the standard deviation. Column A in the demo contains normally distributed random numbers with zero mean and a standard deviation of 0.2 . Note that I selected the latter number for no other reason than the results show up clearly on a fixed-axis plot shown in Figure 2.
  2. Generate a uniformly distributed random number lying between 0 and 2\pi. This represents the angle \theta. This is the content of column B of the demo.
  3. The numbers obtained from steps 1 and 2 for completely specify the location of a hit. The location’s x and y coordinates can be worked out using the formulas x = r\cos\theta and y= r\sin\theta. These are listed in columns C and D in the Excel demo.
  4. Re-run steps 1 through 4 as many times as needed. Note that the demo is set up for 5000 runs. You can change this manually or, better yet, automate it. The latter is left as an exercise for you.

It is instructive to visualize the resulting hits using a scatter plot. Among other things this can tell you, at a glance, if the results make sense. For example, we would expect hits to be symmetrically distributed about the origin because the drunkard’s throws are not biased in any particular direction around the centre). A non-symmetrical distribution is thus an indication that there is an error in the calculations.

Now, any finite collection of hits is unlikely to be perfectly symmetrical because of outliers. Nevertheless, the distributions should be symmetrical on average. To test this, run the demo a few times (hit F9 with the demo open). Notice how the position of outliers and the overall shape of the distribution of points changes randomly from simulation to simulation. In all cases, however, there is a clear maximum at the centre of the dartboard with the probability of a hit falling with distance from the centre.

Figure 3: Scatter plot for standard deviation=0.2

Figure 3 shows the results of simulations for a standard deviation of 0.2. Figures 4 and 5 show the results of simulations for standard deviations of 0.1 and 0.4.

Figure 4: Scatter plot for standard deviation=0.1

Note that the plot has fixed axes- i.e. the area depicted is the 1×1 square that encloses the dartboard, regardless of the standard deviation. Consequently, for larger standard deviations (such as 0.4) many hits will be out of range and will not show up on the plot.

Figure 5: Scatter plot for standard deviation=0.4

Closing remarks

As I have stressed in my previous posts on Monte Carlo simulation, the usefulness of a simulation depends on the choice of an appropriate distribution. If the selected distribution does not reflect reality, neither will the simulation. This is true regardless of whether one is simulating a drunkard’s wayward aim or the duration of project task. You may have noted that the assumption of normally-distributed hits has no justification whatsoever; it is just as arbitrary as my original assumption of uniformity. In fact, the hit locations of drunken dart throws is highly unlikely to be either uniform or Normal. Nevertheless, I hope that some of my readers will find the above example to be of pedagogical value.

Acknowledgement

Thanks to George Gkotsis for his comment which got me thinking about this post.

Written by K

November 3, 2011 at 4:59 am

The drunkard’s dartboard: an intuitive explanation of Monte Carlo methods

with 10 comments

(Note to the reader: An Excel sheet showing sample calculations and plots discussed in this post  can be downloaded here.)

Monte Carlo simulation techniques have been applied to areas ranging from physics to project management. In earlier posts, I discussed how these methods can be used to simulate project task durations (see this post and this one for example). In those articles, I described simulation procedures in enough detail for readers to be able to reproduce the calculations for themselves. However, as my friend Paul Culmsee mentioned, the mathematics tends to obscure the rationale behind the technique. Indeed, at first sight it seems somewhat paradoxical that one can get accurate answers via random numbers. In this post, I illustrate the basic idea behind Monte Carlo methods through an example that involves nothing more complicated than squares and circles. To begin with, however, I’ll start with something even simpler – a drunken darts player.

Consider a sozzled soul who is throwing darts at a board situated some distance from him.  To keep things simple, we’ll assume the following:

  1. The board is modeled by the circle shown in Figure 1, and our souse scores a point if the dart falls within the circle.
  2. The dart board is inscribed in a square with sides 1 unit long as shown in the figure, and we’ll assume for simplicity that the  dart always  falls somewhere  within the square (our protagonist is not that smashed).
  3. Given his state, our hero’s aim is not quite what it should be –  his darts fall anywhere within the square with equal probability. (Note added on 01 March 2011: See the comment by George Gkotsis below for a critique of this assumption)

Figure 1: "Dartboard" and enclosing square

We can simulate the results of our protagonist’s unsteady attempts by generating two sets of  uniformly distributed random numbers lying between 0 and 1 (This is easily done in Excel using the rand() function).  The pairs of random numbers thus generated – one from each set –  can be treated as the (x,y) coordinates of  the dart for a particular throw. The result of 1000 pairs of random numbers thus generated (representing the drunkard’s dart throwing attempts) is shown in Figure 2 (For those interested in seeing the details, an Excel sheet showing the calculations for 100o trials can be downloaded here).

Figure 2: Result for 1000 throws

A trial results in a “hit” if it lies within the circle.  That is, if it satisfies the following equation:

(x-0.5)^2 + (y-0.5)^2 < 0.25\ldots \ldots (1)

(Note:  if we replace “<”  by “=”  in the above expression, we get the equation for a circle of radius 0.5 units, centered at x=0.5 and y=0.5.)

Now, according to the frequency interpretation of probability, the probability of the plastered player scoring a point is approximated by the ratio of the number of hits in the circle to the total number of attempts. In this case, I get an average of 790/1000 which is 0.79 (generated from 10 sets of 1000 trials each). Your result will be different from because you will generate different sets of random numbers from the ones I did. However, it should be reasonably close to my result.

Further, the frequency interpretation of probability tells us that the approximation becomes more accurate as the number of trials increases. To see why this is so, let’s increase the number of trials and plot the results. I carried out simulations for 2000, 4000, 8000 and 16000 trials. The results of these simulations are shown in Figures 3 through 6.

Figure 3: Result for 2000 throws

Figure 4: Result for 4000 throws


Figure 5: Result for 8000 throws

Figure 6: Result for 16000 throws

Since a dart is equally likely to end up anywhere within the square, the exact probability of a hit is simply the area of the dartboard (i.e. the circle)  divided by the entire area over which the dart can land. In this case, since the area of the enclosure (where the dart must fall) is 1 square unit, the area of the dartboard is actually equal to the probability.  This is easily seen by calculating the area of the circle using the standard formula \pi r^2 where r is the radius of the circle (0.5 units in this case). This yields 0.785398 sq units, which is reasonably close to the number that we got for the 1000 trial case.  In the 16000 trial case, I  get a number that’s closer to the exact result: an average of 0.7860 from 10 sets of 16000 trials.

As we see from Figure 6, in the 16000 trial case, the entire square is peppered with closely-spaced “dart marks” – so much so, that it looks as though the square is a uniform blue.  Hence, it seems intuitively clear that as we increase, the number of throws, we should get a better approximation of the area and, hence, the probability.

There are a couple of points worth mentioning here. First, in principle this technique can be used to calculate areas of figures of any shape. However, the more irregular the figure, the worse the approximation – simply because it becomes less likely that the entire figure will be sampled correctly by “dart throws.” Second,  the reader may have noted that although the 16000 trial case gives a good enough result for the area of the circle, it isn’t particularly accurate considering the large number of trials. Indeed, it is known that the “dart approximation” is not a very good way of calculating areas – see this note for more on this point.

Finally, let’s look at connection between the general approach used in Monte Carlo techniques and the example discussed above  (I use the steps described in the Wikipedia article on Monte Carlo methods as representative of the general approach):

  1. Define a domain of possible inputs – in our case the domain of inputs is defined by the enclosing square of side 1 unit.
  2. Generate inputs randomly from the domain using a certain specified probability distribution – in our example the probability distribution is a pair of independent, uniformly distributed random numbers lying between 0 and 1.
  3. Perform a computation using the inputs – this is the calculation that determines whether or not a particular trial is a hit or not (i.e.  if the x,y coordinates obey inequality  (1) it is a hit, else  it’s a miss)
  4. Aggregate the results of the individual computations into the final result – This corresponds to the calculation of the probability (or equivalently, the area of the circle) by aggregating the number of hits for each set of trials.

To summarise: Monte Carlo algorithms generate random variables (such as probability) according to pre-specified distributions.  In most practical applications  one will use more efficient techniques to sample the distribution (rather than the naïve method I’ve used here.)  However, the basic idea is as simple as playing drunkard’s darts.

Acknowledgements

Thanks go out to Vlado Bokan for helpful conversations while this post was being written and to Paul Culmsee for getting me thinking about a simple way to explain Monte Carlo methods.

Written by K

February 25, 2011 at 4:51 am

Monte Carlo simulation of risk and uncertainty in project tasks

with 5 comments

Introduction

When developing duration estimates for a project task, it is useful to make a distinction between the  uncertainty inherent in the task and uncertainty due to known risks.  The former is uncertainty due to factors that are not known whereas the latter corresponds uncertainty due to events that are known, but may or may not occur. In this post, I illustrate how the two types of uncertainty can be combined via Monte Carlo simulation.  Readers may find it helpful to keep my introduction to Monte Carlo simulations of project tasks handy, as I refer to it extensively in the present piece.

Setting the stage

Let’s assume that there’s a task that needs doing, and the person who is going to do it reckons it will take between 2 and 8 hours to complete it, with a most likely completion time of 4 hours. How the estimator comes up with these numbers isn’t important here – maybe there’s some guesswork, maybe some padding or maybe it is really based on experience (as it should be).  For simplicity we’ll assume the probability distribution for the task duration is triangular. It is not hard to show that, given the above mentioned estimates, the probability, p(t),  that the task will finish at time t is given by the equations below (see my introductory post for a detailed derivation):

p(t)=\frac{(t-2)}{6}\dots\ldots(1) for  2 hours \leq t \leq 4 hours

And,

p(t)=\frac{(8-t)}{12}\dots\ldots(2) for  4 hours \leq t \leq 8 hours

These two expressions are sometimes referred to as the probability distribution function (PDF).  The PDF described by equations (1) and (2) is illustrated in Figure 1. (Note: Please click on the Figures to view full-size images)

Figure 1: Probability distribution for task

Now, a PDF tells us the probability that the task will finish at a particular time t. However, we are more interested in knowing whether or not the task will be completed by time t. – i.e. at or before time t. This quantity, which we’ll denote by P(t) (capital P), is sometimes known as the cumulative distribution function (CDF). The CDF  is obtained by summing up the probabilities from t=2 hrs to time t.  It is not hard to show that the CDF for the task at hand is given by the following equations:

P(t)=\frac{(t-2)^2}{12}\ldots\ldots(3) for  2 hours \leq t \leq 4 hours

and,

P(t)=1- \frac{(8-t)^2}{24}\ldots\ldots(4) for  4 hours \leq t \leq 8 hours

For a detailed derivation, please see my introductory post. The CDF for the distribution is shown in Figure 2.

Figure 2: CDF for task

Now for the complicating factor: let us assume there is a risk that has a bearing on this task.  The risk could be any known factor that has a negative impact on task duration. For example, it could be that a required resource is delayed or that the deliverable will fails a quality check and needs rework. The consequence of the risk – should it eventuate – is that the task takes longer.  How much longer the task takes depends on specifics of the risk. For the purpose of this example we’ll assume that the additional time taken is also described by a triangular distribution with a minimum, most likely and maximum time of 1, 2 and 3 hrs respectively.  The PDF p_{r}(t) for the additional time taken due to the risk is:

p_{r}(t)=(t-1)\dots\ldots(5) for  1 hour \leq t \leq 2 hours

And

p_{r}(t)=(3-t)\dots\ldots(6) for  2 hrs \leq t \leq 3 hours

The figure for this distribution is shown in Figure 3.

Figure 3: Probability distribution of additional time due to risk

The CDF for the additional time taken if the risk eventuates (which we’ll denote by P_{r}(t)) is given by:

P_{r}(t)=\frac{(t-1)^2}{2}\ldots\ldots(7) for  1 hour \leq t \leq 2 hours

and,

P_{r}(t)=1- \frac{(3-t)^2}{2}\ldots\ldots(8) for  2 hours \leq t \leq 3 hours

The CDF for the risk consequence is shown in Figure 4.

Figure 4: CDF of additional time due to risk

Before proceeding with the simulation it is worth clarifying what all this means, and what we want to do with it.

Firstly, equations 1-4 describe the inherent uncertainty associated with the task while equations 5 through 8 describe the consequences of the risk, if it eventuates.

Secondly, we have described the task and the risk separately. In reality, we need a unified description of the two – a combined distribution function for the uncertainty associated with the task and the risk taken together.  This is what the simulation will give us.

Finally, one thing I have not yet specified is the probabilty that the risk will actually occur. Clearly, the higher the probability, the greater the potential delay. Below I carry out simulations for risk probabilities of varying from 0.1 to 0.5.

That completes the specification of the  problem – let’s move on to the simulation.

The simulation

The simulation procedure I used  for the zero-risk case  (i.e. the task described by equations 1 and 2 ) is as follows :

  1. Generate a random number between 0 and 1.  Treat this number as the cumulative probability, P(t) for the simulation run. [You can generate random numbers in Excel using the  rand() function]
  2. Find the time, t,  corresponding to P(t) by solving equations (3) or (4) for t. The resulting value of t is the time taken to complete the task.
  3. Repeat steps (1) and (2)  for a sufficiently large number of trials.

The frequency distribution of completion times for the task, based on 30,000 trials is shown in Figure 5.

Figure 5: Simulation histogram for zero-risk case

As we might expect,  Figure 5 can be translated to the probability distribution shown in Figure 1 by a straightforward normalization – i.e. by dividing each bar by the total number of trials.

What remains to be done is to  incorporate the risk (as modeled in equations 5-6) into the simulation. To simulate the task with the risk, we simply do the following for each trial:

  1. Simulate the task without the risk as described earlier.
  2. Generate another random number between 0 and 1.
  3. If the random number is less than the probability of the risk, then simulate the  risk. Note that since the risk is described by a triangular function, the procedure to simulate it is the same as that for the task (albeit with different parameters).
  4. If the random number is greater than the probability of the risk, do nothing.
  5. Add the results of 1 and 4. This is the outcome of the trial.
  6. Repeat steps 1-5 for as many trials as required.

I performed simulations for the task with risk probabilities of 10%, 30% and 50%. The frequency distributions of completion times for these are displayed in Figures 6-8 (in increasing order of probability). As one would expect, the spread of times increases with increasing probability. Further, the distribution takes on a distinct second peak as the probability increases: the first peak is at t=4, corresponding to the most likely completion time of the risk-free task and the second at t=6 corresponding to the most likely additional time of 2 hrs if the risk eventuates.

Figure 6: Simulation histogram (10% probability of risk)

Figure 7: Simulation histogram (30% probability of risk)

Figure 8: Frequency histogram (50% probability of risk)

It is also instructive to compare average completion times for the four cases (zero-risk and 10%, 30% and 50%). The average can computed from the simulation by simply adding up the simulated completion times (for all trials) and dividing by the total number of simulation trials (30,000 in our case). On doing this, I get the following:

Average completion time for zero-risk case = 4.66 hr

Average completion time with 10% probability of risk =  4.89 hrs

Average completion time with 30% probability of risk =  5.36 hrs

Average completion time with 50% probability of risk=  5.83 hrs

No surprises here.

One point to note is that the result obtained from the simulation for the zero-risk case compares well with the exact formula for a triangular distribution (see the Wikipedia article for the triangular distribution):

t_{av} = \frac{t_{worst}+t_{best}+t_{most likely}}{3}=\frac{8+2+4}{3}=4.67 hrs

This serves as a sanity check on the simulation procedure.

It is also interesting to compare the cumulative probabilities of completion in the zero-risk and high risk (50% probability) case. The CDFs for the two are shown in Figure 9. The co-plotted CDFs allow for a quick comparison of completion time predictions. For example, in the zero-risk case, there is about a  90% chance that the task will be completed in a little over 6 hrs whereas when the probability of the risk is 50%, the 90% completion time increases to 8 hrs (see Figure 9).

Figure 9: CDFs for zero risk and 50% probability of risk cases

Next steps and wrap up

For those who want to learn more about simulating project uncertainty and risk, I can recommend the UK MOD paper – Three Point Estimates And Quantitative Risk Analysis A Process Guide For Risk Practitioners.  The paper provides useful advice on how three point estimates for project variables should be constructed. It also has a good discussion of risk and how it should be combined with the inherent uncertainty associated with a variable. Indeed, the example I have described above  was inspired by the paper’s discussion of uncertainty and risk.

Of course, as with any quantitative predictions of project variables, the numbers are only as reliable as the assumptions that go into them, the main assumption here being the three point estimates that were used to derive the distributions for the task uncertainty and risk (equations 1-2 and 5-6).  Typically these are obtained from historical data. However, there are well known problems associated with history-based estimates. For one, as we can never be sure that the historical tasks are similar to the one at hand in ways that matter (this is the reference class problem).  As Shim Marom warns us in this post, all our predictions depend on the credibility of our estimates.  Quoting from his post:

Can you get credible three point estimates? Do you have access to credible historical data to support that? Do you have access to Subject Matter Experts (SMEs) who can assist in getting these credible estimates?

If not, don’t bother using Monte Carlo.

In closing, I hope my readers will find this simple example useful in understanding how uncertainty and risk can be accounted for using Monte Carlo simulations. In the end, though, one should always keep in mind that the use of sophisticated techniques does not magically render one immune to the GIGO principle.

Written by K

February 17, 2011 at 9:55 pm

The reference class problem and its implications for project management

with 14 comments

Introduction

Managers make decisions based on incomplete information, so it is no surprise that the tools of probability and statistics have made their way into management practice. This trend has accelerated somewhat over the last few years, particularly with the availability of software tools that simplify much of the grunt-work of using probabilistic techniques such as Monte Carlo methods or Bayesian networks. Purveyors of tools and methodologies and assume probabilities (or more correctly, probability distributions) to be known, or exhort users to determine probabilities using relevant historical data. The word relevant is important: it emphasises that the data used to calculate probabilities (or distributions) should be from situations that are similar to the one at hand. This innocuous statement papers over a fundamental problem in the foundations of probability: the reference class problem. This post is a brief introduction to the reference class problem and its implications for project management.

I’ll begin with some background and then, after defining the problem, I’ll present a couple of illustrations of the problem drawn from project management.

Background and the Problem

The most commonly held interpretation of probability is that it is a measure of the frequency with which an event of interest occurs. In this frequentist view, as it is called, probability is defined as the ratio of the number of times the event of interest occurs to the total number of events. An example might help clarify what this means: the probability that a specific project will finish on time is given by the ratio of the number of similar projects that have finished on time to the total number of similar projects undertaken (including both on-time and not-on-time projects).

At first sight the frequentist approach seems a reasonable one. However, in this straightforward definition of probability lurks a problem: how do we determine which events are similar to the one at hand? In terms of the example: what are the criteria by which we can determine the projects that resemble the one we’re interested in? Do we look at projects with similar scope, or do we use size (in terms of budget, resources or other measure), or technology or….? There could be a range of criteria that one could use, but one never knows with certainty which one(s) is (are) the right one(s). Why is it an issue? It’s an issue because probability changes depending on the classification criteria used. This is the reference class problem.

In a paper entitled The Reference Class Problem is Your Problem Too, the philosopher Alan Hajek sums it up as follows:

The reference class problem arises when we want to assign a probability to a proposition (or sentence, or event) X, which may be classified in various ways, yet its probability can change depending on how it is classified.

Incidentally, in another paper entitled Conditional Probability is the Very Guide of Life, Hajek discusses how the reference class problem afflicts all major interpretations of probability, not just the frequentist approach. We’ll stick with the latter interpretation since it is the one used in project management practice and research… and virtually all the social and natural sciences to boot.

The reference class problem in project management

Let’s look at a couple of project management-related illustrations of the reference class problem.
First up, consider the technique of reference class forecasting which I’ve discussed in this post. Note that reference class forecasting technique is distinct from the reference class problem although, as we shall see in less than a minute, the technique is fatally afflicted by the problem.

What’s reference class forecasting? To quote from the post referenced earlier, the technique involves:

…creating a probability distribution of estimates based on data for completed projects that are similar to the one of interest, and then comparing the said project with the distribution in order to get a most likely outcome. Basically, [it] consists of the following steps:

  1. Collecting data for a number of similar past projects – these projects form the reference class. The reference class must encompass a sufficient number of projects to produce a meaningful statistical distribution, but individual projects must be similar to the project of interest.
  2. Establishing a probability distribution based on (reliable!) data for the reference class. The challenge here is to get good data for a sufficient number of reference class projects.
  3. Predicting most likely outcomes for the project of interest based on comparisons with the reference class distribution.

Now, the key assumption in reference class forecasting is that it is possible to identify a number of completed projects that are similar to the one at hand. But what does “similar” mean? Clearly the composition of the reference class depends on the similarity criteria used, and consequently so does the resulting distribution. Reference class forecasting is a victim of the reference class problem!

The reference class problem will affect any technique that uses arbitrary criteria to determine the set of all possible events. As another example, the probability distributions used in Monte Carlo simulations (of project cost, duration or whatever) are determined using historical data. Again, typically one selects projects (or tasks – if one is doing a task level simulation) that are similar to the one at hand. Defining “similar” is left to common sense or expert judgement or some other subjective approach. Yet, by the most commonly used definition, a project is a “temporary endeavor, having a defined beginning and end, undertaken to meet unique goals and objectives”. By definition, therefore, we never do the same project twice – at best we do the same project differently  (and the same applies to tasks). So, despite ones best intentions and efforts, historical data can never be totally congruent to the situation at hand. There will always be differences, and one cannot tell with certainty that those differences do not matter.

Truth be told, most organizations do not retain data on completed projects – except superficial stuff that isn’t much use. The reference class problem seems to justify the position of this slack majority. After all, why bother keeping data when one isn’t able to use it to predict project performance. This argument is wrong-headed: although one cannot use it to calculate probabilities, historical data is useful because it keeps us from repeating our errors.  Just don’t expect the data to yield reliable quantitative information on probabilities.

Before I close this piece, I should clarify that there are areas in which the reference class problem is not an issue. In physics, for example, the discipline of statistical mechanics is founded on the principle that the average motion of large collections of molecules can be treated statistically. Clearly, there is no problem here: molecules are indeed indistinguishable from each other, so it is clear that a particular molecule (of a gas in a container of carbon dioxide, say) belongs to the reference class of all carbon dioxide molecules in that container. In general this is true of any situation where one is dealing with a large collection of identical (or very similar) entities.

Conclusion

The reference class problem affects most probabilistic methods in project management  and  other areas of the social sciences. It is a problem because it is often impossible to know beforehand  which attributes of the objects or events of interest are the most significant ones. Consequently it is impossible to determine with certainty whether or not a particular object or event belongs to a defined reference class.

I’ll end with an anecdote to illustrate my point:

Some time ago I was asked to provide estimates for design work that was superficially similar to something I’d done before. “You’ve done this before,” a manager said, “so you should be able to estimate this quite accurately.”

As many best practices and methodologies recommend, I used a mix of historical data and “expert” judgement (and added in a dash of padding) to arrive at (what I thought was) a robust estimate. To all you agilists out there, an incremental approach was not an option in this case.

I got it wrong – badly wrong. It turned out that the unique features of the project, which weren’t apparent at first, made a mockery of my estimates. I didn’t know it then, but I’d fallen victim to the reference class problem.

Finally, it should be clear that although my examples are project management focused, the arguments are quite general. They apply to all areas of management theory and practice, and indeed to most areas of inquiry that use probabilistic techniques. To use the words of Alan Hajek:  the reference class problem is your  problem too.

Written by K

May 13, 2010 at 11:42 pm

When more knowledge means more uncertainty – a task correlation paradox and its resolution

with 3 comments

Introduction

Project tasks can have a variety of dependencies. The most commonly encountered ones are  task scheduling dependencies such as finish-to-start and  start-to-start relationships which are available in many scheduling tools.  However, other kinds of dependencies are possible too. For example, it can happen that the durations of two tasks are correlated in such a way that if one task takes longer or shorter than average, then so does the other.  [Note: In statistics such a relationship between two quantities is called a positive correlation and an inverse relationship is termed a negative correlation]. In the absence of detailed knowledge of the relationship,  one can model such duration dependencies through statistical correlation coefficients. In my previous post, I showed – via Monte Carlo simulations – that the uncertainty in the duration of a project increases if project task durations are positively correlated (the increase in uncertainty being relative to the uncorrelated case).  At first sight this is counter-intuitive, even paradoxical.  Knowing that tasks are correlated essentially amounts to more knowledge about the tasks as compared to the uncorrelated case.  More knowledge should equate to less uncertainty, so one would expect the uncertainty to decrease compared to the uncorrelated case. This post discusses the paradox and its resolution using the example presented in the previous post.

I’ll begin with a brief recapitulation of the main points of the previous post and then discuss the paradox in some detail.

The example and the paradox

The “project” that I simulated consisted of two identical, triangularly distributed tasks performed sequentially.  The triangular distribution for each of the tasks had the following parameters:  minimum, most likely and maximum durations of 2, 4 and 8 days respectively.   Simulations were carried out for two cases:

  1. No correlation between the two tasks.
  2. A correlation coefficient of 0.79 between the two tasks.

The simulations yielded probability distributions for overall completion times for the two cases. I then calculated the standard deviation for both distributions. The standard deviation is a measure of the “spread” or uncertainty represented by a distribution. The standard deviation for the correlated case turned out to be more than 30% larger than that for the uncorrelated case (2.33 and 1.77 days respectively), indicating that the probability distribution for the correlated case has a much wider spread than that for the uncorrelated case. The difference in spread can be  seen quite clearly in figure 5 of my previous post, which depicts the frequency histograms for the two simulations (the  frequency histograms are essentially proportional to the probability distribution). Note that the averages for the two cases are 9.34 and 9.32 days –  statistically identical, as we might expect, because the tasks are identically distributed.

Why is the uncertainty (as measured by the standard deviation of the distribution) greater in the correlated case?

Here’s a brief explanation why. In the uncorrelated case, the outcome of the first task has no bearing on the outcome of the second. So if the first task takes longer than the average time (or more precisely, median time), the second one would have an even chance of finishing before the average time of the distribution. There is, therefore, a good chance in the uncorrelated case that overruns (underruns) in the first task will be cancelled out by underruns (overruns) in the second.  This is essentially why the combined distribution for the uncorrelated case is more symmetric than that of the correlated case (see figure 5 of the previous post).  In the correlated case, however, if the first task takes longer than the median time, chances are that the second task will take longer than the median too (with a similar argument holding for shorter-than-median times). The second task thus has an effect of amplifying the outcome of the first task.  This effect becomes more pronounced as we move towards the extremes of the distribution, thus making extreme outcomes more likely than in the uncorrelated case. This has the effect of broadening the combined probability distribution – and hence the larger standard deviation.

Now, although the above  explanation is technically correct, the sense that something’s not quite right remains: how can it be that knowing more about the tasks that make up a project results in increased overall uncertainty?

Resolving the paradox

The key to resolving the paradox  lies in looking at the situation after task A has completed but B is yet to start.  Let’s look at this in some detail.

Consider the uncorrelated case first. The two tasks are independent, so after A completes, we still know nothing more about the possible duration of  B other than that it is triangularly distributed with min, max and most likely times of 2, 4 and 8 days. In the correlated case, however, the duration of B tracks the duration of A – that is, if A takes a long (or short) time then so will B.  So, after A has completed,  we have a pretty good idea of how long   B will take. Our knowledge of the correlation works to reduce the uncertainty in B  – but only after A is done.

One can also frame the argument in terms of conditional probability.

In the uncorrelated case, the probability distribution of B – let’s call it p(B)  – is independent of A.  So the conditional probability of B given that A has already finished (often denoted as P(B|A))  is identical to  P(B).  That is, there is no change in our knowledge of B after A has completed.  Remember that  we know p(B) – it is a triangular distribution with min, max and most likely completion times of 2, 4 and 8 days respectively.  In the correlated case, however, P(B|A) is not the same as P(B) – the knowledge that A has completed has a huge bearing on the distribution of B.  Even if one does not know the conditional distribution of B, one can say with some certainty  that outcomes close to the duration of A are  very likely, and outcomes substantially different from A are highly unlikely. The degree of “unlikeliness”  – and the consequent shape of the distribution –  depends on the value of the correlation coefficient.

Endnote

So we see that, on the one hand,  positive correlations between tasks increase uncertainty in the overall duration of the two tasks. This happens because a wider range of outcomes are possible when the tasks are correlated. On the other hand knowledge of the correlation can also reduce uncertainty – but only after one of the correlated tasks is done.  There is no paradox here,  its all a question of where we are on the project timeline.

Of course, one can argue that the paradox is an artefact of the assumption that the two tasks  remain triangularly distributed in the correlated case. It is far from obvious that this assumption is correct, and it is hard to validate in the real world. That said, I should add that most commercially available simulation tools treat correlations in much the same way as I have done in my previous post – see this article from the @Risk knowledge base, for example.

In the end, though,  even if the paradox is only an artefact of modelling and has no real world application, it is  still a good pedagogic example of how probability distributions can combine to give counter-intuitive results.

Acknowledgement:

Thanks to Vlado Bokan for several interesting conversations relating to this paradox.

Written by K

December 17, 2009 at 6:32 am

The effect of task duration correlations on project schedules – a study using Monte Carlo simulation

with 3 comments

Introduction

Some time ago, I wrote a couple of posts on Monte Carlo simulation of project tasks: the the first post presented a fairly detailed introduction to the technique and the second illustrated its use via three simple examples. The examples in the latter demonstrated the effect of various dependencies on overall completion times. The dependencies discussed were: two tasks in series (finish-to-start dependency), two tasks in series with a time delay (finish-to-start dependency with a lag) and two tasks in parallel (start-to-start dependency).  All of these are dependencies in timing:  i.e. they dictate when a successor task can start in relation to its predecessor. However, there are several practical situations in which task durations are correlated –  that is, the duration of one task depends on the duration of another. As an example, a project manager working for an enterprise software company might notice that the longer it takes to elicit requirements the longer it takes to customise the software.  When tasks are correlated thus, it is of interest to find out the effect of the correlation on the overall (project) completion time. In this post I explore the effect of correlations on project schedules via Monte Carlo simulation of a simple “project” consisting of two tasks in series.

A bit about what’s coming before we dive into it.  I begin with a brief discusssion on how correlations are quantified.  I then describe the simulation procedure, following which I present results for the example mentioned earlier, with and without correlations. I then present a detailed  comparison of the results for the uncorrelated and correlated cases. It turns out that correlations increase uncertainty. This  seemed counter-intuitive to me at first, but the simulations helped me see why it is so.

Note that I’ve attempted to keep the discussion intuitive and (largely)  non-mathematical by  relying on  graphs and tables rather than formulae.   There are a few formulae but most of  these can be skipped quite safely.

Correlated project tasks

Imagine that there are two project tasks, A and B, which need to be performed sequentially.  To keep things simple,  I’ll assume that the durations of  A and B  are described by a triangular distribution with minimum, most likely and maximum completion times of 2, 4 and 8 days respectively (see my introductory Monte Carlo article for a detailed discussion of this distribution – note that I used hours as the unit of time in that post).  In the absence of any other information, it is reasonable to assume that the durations of A and B are independent or uncorrelated – i.e. the time it takes to complete task A does not have any effect on the duration of task B.  This assumption can be tested if we have historical data.  So  let’s assume we have the following historical data gathered from 10 projects:

Duration A (days)) duration B (days)
2.5 3
3 3
7 7.5
6 4.5
5.5 3.5
4.5 4.5
5 5.5
4 4.5
6 5
3 3.5

Figure 1 shows a plot of the duration of A vs. the duration of B.  The plot suggests that there is a relationship between the two tasks – the longer A takes, the chances are that B will take longer too.

Figure 1: Duration of A vs. Duration of B

In technical terms we would say that A and B are positively correlated (if one decreased as the other increased, the correlation would be negative).

There are several measures of correlation, the most common one being Pearson’s coefficient of correlation r which is given by

r= \frac{\sum (x_{i} - \bar x)(y_{i}-\bar y)}{\sqrt{\sum (x_{i}-\bar x)^2}\sqrt{\sum (y_{i}-\bar y^2)}}\ldots\ldots (1)

In this case x_{i}  and y_{i} are the durations of the tasks A and B the ith time the project was performed, \bar x the average duration of A, \bar y the average duration of B and N the total number of data points (10 in this case). The capital sigma (\Sigma) simply denotes a sum from 1 to N.

The Pearson coefficient, can vary between -1 and 1: the former being a perfect negative correlation and the latter a perfect positive one   [Note: The Pearson coefficient is sometimes referred to as the product-moment correlation coefficient].    On calculating r for the above data, using the CORREL function in Excel, I get a value of 0.787 (Note that one could just as well use the PEARSON function). This is a good indication that there is something going on here – the two tasks are likely not independent as originally assumed.  Note that the correlation coefficient does not tell us  anything about the form of the dependence between A and B; it only tells us that they are dependent and whether the dependence is positive or negative. It is also important to note that there is a difference between quantifying the correlation via the Pearson (or any other) coefficient and developing an understanding of why there is a correlation. The coefficient tells us nothing about the latter.

If A and B are correlated as discussed above, simulations which assume the tasks to be independent will not be correct.  In the remainder of this article I’ll discuss how correlations affect overall task durations via a Monte Carlo simulation of the aforementioned example.

Simulating correlated project tasks

There are two considerations when simulating correlated tasks. The first is to characterize the correlation accurately. For the purposes of the present discussion I’ll assume that the correlation is described adequately by a single  coefficient as discussed in the previous section. The second issue is to generate correlated completion times that   satisfy the individual task duration distributions (Remember that the two tasks A and B have completion times that  are described by a  triangular distribution with minimum, maximum and most likely times of 2, 4 and 8 days).  What we are asking for, in effect, is a way to generate a series of two correlated random numbers, each of which satisfy the triangular distribution.

The best known algorithm to generate correlated sets of random numbers in a way that preserves the individual (input) distributions is due to Iman and Conover. The beauty of the Iman-Conover algorithm is that it takes the uncorrelated data for tasks A and B (simulated separately) as input and induces the desired correlation by simply re-ordering the uncorrelated data. Since the original data is not changed, the distributions for A and B are preserved.  Although the idea behind the method is simple, it is technically quite complex. The details of the technique aren’t important – but I offer a  partial “hand-waving” explanation in the appendix at the end of this post. Fortunately  I didn’t have to implement the Iman-Conover algorithm because someone else has done the hard work: Steve Roxburgh has written a graphical tool  to  generate sets of  correlated random variables using  the technique (follow this link to download the software and this one to view a brief tutorial) .  I used Roxburgh’s utility to generate sets of random variables for my simulations.

I looked at two cases: the first with no correlation between A and B and the second with a correlation of 0.79 between A and B.  Each simulation consisted of 10,000 trials – basically I generated two sets of 10,000 triangularly-distributed random numbers, the first with a correlation coefficient close to zero and the second with a correlation coefficient of 0.79. Figures 2 and 3 depict scatter plots of the durations of A vs. the durations of B (for the same trial) for the uncorrelated and correlated cases.  The correlation is pretty clear to see in Figure 3.

Figure 2: Scatter plot of duration of A vs. duration of B (uncorrelated)

Figure 3: Scatter plot of duration of A vs. duration of B (correlated, r=.79)

Figure 3: Scatter plot of duration of A vs. duration of B (correlated)

To check that the generated trials for A and B do indeed satisfy the triangular distribution, I divided the difference between the minimum and maximum times (for the individual tasks) into 0.5 day intervals and plotted the number of trials that fall into each interval. The resulting histograms are shown in Figure 4. Note that the blue and red bars are frequency plots for the case where A and B are uncorrelated and the green and pink  (purple?) bars are for the case where they are correlated.

Figure 4: Frequency histograms for A and B (correlated and uncorrelated)

The histograms for all four cases are very similar, demonstrating that they all follow the specified triangular distribution. Figures 2 through 4  give confidence (but do not prove!) that Roxburgh’s utility works as advertised: i.e.  that it generates sets of correlated random numbers in a way that preserves the desired distribution.

Now, to simulate A and B in sequence I simply added the durations of the individual tasks for each trial.  I did this twice – once each for the correlated and uncorrelated data sets –   which yielded two sets of completion times, varying between 4 days (the theoretical minimum) and 16 days (the theoretical maximum).  As before, I plotted a frequency histogram for the uncorrelated and correlated case (see Figure 5).  Note that the difference in the heights of the bars has no significance – it is an artefact of  having the same number of trials (10,000)  in both cases. What is significant is the difference in the spread of the two plots – the correlated case has a greater spread signifying an increased probability of very low and very high completion times compared to the uncorrelated case.

Figure 5: Frequency histograms for overall duration (blue=uncorrelated, red=correlated)

Note that the uncorrelated case resembles a Normal distribution – it is more symmetric than the original triangular distribution. This is a consequence of the Central Limit Theorem which states that the sum of identically distributed, independent (i.e. uncorrelated) random numbers is Normally distributed, regardless of the form of original distribution. The correlated distribution, on the other hand, has retained the shape of the original triangular distribution. This is no surprise: the relatively high correlation coefficient ensures that A and B will behave in a similar fashion and, hence, so will their sum.

Figure 6 is a plot of the cumulative distribution function (CDF)  for the uncorrelated and correlated cases. The value of the CDFat any time t  gives the probability that the overall task will finish within time t.

Figure 6: Cumulative probability distribution for uncorrelated (blue) and correlated (red) cases

The cumulative distribution clearly shows the greater spread in the correlated case: for small values of t, the correlated distribution is significantly greater than the uncorrelated one; whereas for high values of t, the correlated distribution approaches the limiting value of 1 more slowly than the uncorrelated distribution. Both these factors point to a greater spread in the correlated case. The spread can be quantified by looking at the standard deviation of the two distributions.  The standard deviation, often denoted by the small greek letter sigma (\sigma), is given by:

\sigma =\sqrt{\frac{\sum (t_{i}-\bar t)^2}{N}} \ldots\ldots (2)

wher  N is the total number of trials (10000),  t_{i} is the completion time for the ith trial and \bar t is the average completion time which is given by,

\bar t= \frac{\sum t_{i}}{N} \ldots\ldots (3)

In both (2) and (3) \sum denotes a sum over all trials.

The averages, \bar t,  for the uncorrelated and correlated cases are virtually identical:  9.32 days and 9.34  days respectively.  On the other hand, the standard deviations for the two cases are 1.77 and 2.34 respectively –demonstrating the wider spread in possible completion times for the correlated case.   And, of course,  a  wider spread means  greater uncertainty.

So,  the simulations tell us that correlations increase uncertainty. Let’s try to understand why this happens. Basically, if tasks are correlated positively, they  “track” each other: that is,  if one takes a long time so will the other (with the same holding for short durations).  The upshot of this is that the overall completion time tends to get “stretched” if the first task takes longer than average whereas it gets “compressed” if the first task finishes earlier than average. Since the net effect of stretching and compressing would balance out, we would expect the mean completion time (or any other measure of central tendency – such as the mode or median) to be relatively unaffected. However, because extremes are amplified, we would expect the  spread of the distribution to increase.

Wrap-up

In this post  I have highlighted the effect of task correlations on project schedules  by comparing the results of simulations for two sequential tasks with and without correlations. The example shows that correlations can  increase uncertainty.  The mechanism is easy to understand: correlations tend to amplify extreme outcomes, thus increasing the spread in the resulting distribution.  The effect of the correlation (compared to the uncorrelated case) can be quantified by comparing the standard deviations of the two cases.

Of course, quantifying correlations using a single number is simplistic –  real life correlations have all kinds of complex dependencies.   Nevertheless,  it is a useful first step because  it helps one develop an intuition for what might happen in more complicated cases: in hindsight it is easy to see that  (positive) correlations will amplify extremes,  but the simple model helped me really see it.

— —

Appendix – more on  the Iman-Conover algorithm

Below I offer a hand-waving, half- explanation of how the technique works; those interested in a proper, technical explanation should see this paper by Simon Midenhall.

Before I launch off into my explanation, I’ll need to take a bit of a detour on coefficients of correlation. The title of Iman and Conover’s paper talks about rank correlation which is different from  product-moment (or Pearson) correlation discussed in this post.  A popular measure of rank correlation is the Spearman coefficient, r_{s}, which is given by:

r_{s} = 1-\frac{6d_{i}^2}{N(N^2-1)}

where d_{i} is the rank difference between the duration of A and B on the ith  instance of the project. Note that rank is calculated relative to all the other instances of a particular task (A or B).  This is best explained through the table below, which shows the ranks for all instances of task A and B from my earlier example (columns 3 and 4).

duration A (days) duration B (days) rank A rank B rank difference
2.5 3 1 1 0
3 3 2 1 1
7 7.5 10 10 0
6 4.5 8 5 9
5.5 3.5 7 3 16
4.5 4.5 5 5 0
5 5.5 6 9 9
4 4.5 4 5 1
6 5 8 7 1
3 3.5 2 3 1

Note that ties cause the subsequent number to be skipped.

The last column lists the rank differences, d_{i}.  The above can be used to calculate r_{s}, which works out to 0.770 – which is quite close to the Pearson coefficient calculated earlier (0.787). In practical terms, the Spearman coefficient is often considered to be an approximation to the Pearson coefficient.

With that background about the rank correlation, we can now move on to a brief discussion of the Iman-Conover algorithm.

In essence, the Iman-Conover method relies on reordering the set of to-be-correlated variables to have the same rank order as a reference distribution which has the desired correlation. To paraphrase from  Midenhall’s paper (my two cents in italics):

Given two samples of n values from known distributions X and Y (the triangular distributions for A and B in this case) and a desired correlation between them (of 0.78), first determine a sample from a reference distribution that has exactly the desired linear correlation (of 0.78). Then re-order the samples from X and Y to have the same rank order as the reference distribution. The output will be a sample with the correct (individual, triangular) distributions and with rank correlation coefficient equal to that of the reference distribution…. Since linear (Pearson) correlation and rank correlation are typically close, the output has approximately the desired correlation structure…

The idea is beautifully simple, but a problem remains. How does one calculate the required reference distribution?  Unfortunately, this is a fairly technical affair for which I could not find a simple explanation – those interested in a proper, technical discussion of the technique should see Chapter 4 of Midenhall’s paper or the original paper by Iman and Conover.

For completeness I should note that some folks have criticised the use of  the Iman-Conover algorithm on the grounds that it generates rank correlated random variables instead of  Pearson correlated ones. This is a minor technicality which does not impact the main conclusion of this post:  i.e. that  correlations increase uncertainty.

Written by K

December 10, 2009 at 10:13 pm

%d bloggers like this: