Many outsourcing arrangements fail because customers do not factor in hidden costs. In 2009, I wrote a post on these hard-to-quantify transaction costs. The following short video (4 mins 45 secs) summarises the main points of that post in a (hopefully!) easy-to-understand way:
Note: Here’s the full script, for those who prefer to read instead of watching…
One of the questions that organisations grapple with is whether or not to outsource IT work to external vendors. The work of Oliver Williamson a Nobel Laureate in Economics – provides some insight into this issue. This video is a brief look at how Williamson’s work on transaction cost economics can be applied to the question of outsourcing IT development or implementation.
A firm has two choices for any economic activity: it can either perform the activity in-house or go to market. In either case, the cost of the activity can be decomposed into production costs, which are direct and indirect costs of producing the good or service, and transaction costs, which are costs associated with making the economic exchange (more on this in a minute).
In the case of in-house IT work production costs include salaries, equipment costs etc whereas transaction costs include costs relating to building an IT team (with the right skills, attitude and knowledge).
In the case of outsourced IT work, production costs are similar to those in the in-house case – except that they are now incurred by the vendor and passed on to the client. The point is, these costs are generally known upfront.
The transaction costs, however, are significantly different. They include things such as:
- Search costs: cost of searching for a suitable vendor
- Bargaining costs: effort incurred in agreeing on an acceptable price.
- Enforcement costs: costs of ensuring compliance with the contract
- Costs of coordinating work : this includes costs of managing the vendor.
- Cost of uncertainty: cost associated with unforeseen changes (scope change is a common example)
Now, there are a couple of things to note about transaction costs for outsourcing arrangements:
Firstly, they are typically the client’s problem, not the vendors. Secondly, they can be very hard to figure out upfront. They are the therefore the hidden costs of outsourcing.
According to Williamson, the decision as to whether or not an economic activity should be outsourced depends critically on these hidden transaction costs. In his words, “The most efficient institutional arrangement for carrying out a particular economic activity would be the one that minimized transaction costs.”
The most efficient institutional arrangement for IT development work is often the market, but in-house arrangements are sometimes better.
The potentially million dollar question is: when are in-house arrangements better?
Williamson’s work provides an answer to this question. He argues that the cost of completing an economic transaction in an open market depends on two factors
- Complexity of the transaction – for example, implementing an ERP system is more complex than implementing a new email system.
- Asset specificity – this refers to the degree of customization of the service or product. Highly customized services or products are worth more to the two parties than to anyone else. For example, custom IT services, tailored to the requirements of a specific company have more value client and provider than to anyone else.
In essence, the transaction costs increase with complexity and degree of customization. From this we can conclude that in-house arrangements may be better for work that is complex or highly customized. The reason for this is simple: it is difficult to specify such systems in detail upfront. Consequently, contracts for such work tend to be complex…and worse, they invariably leave out important details.
Such contracts will work only if interpreted in a farsighted manner, with disputes being settled directly between the vendor and client instead of resorting to litigation. When this becomes too hard to do, it makes sense to carry out the activity in-house. Note that this does not mean that it has to be done by internal staff…one can still hire contractors, but it is important ensure that they remain under internal supervision.
If one chooses to outsource such work it is important to ensure that the contract is as unambiguous and transparent as possible. Moreover, both the client and the vendor should expect omissions in contracts, and be flexible whenever there are disagreements over the interpretation of contract terms. In this end, this is possible only if there is a trust-based relationship between the client and vendor…and trust, of course, is impossible to contractualise.
To sum up: be wary of outsourcing work that is complex or highly customized…and if you must, be sure to go with a vendor you trust.
I’ve recently set up a consulting practice specializing in sensemaking and analytics. Most people understand the analytics bit, but many have questions about sensemaking. I got that question so many times that I decided to do a short (2.5 minute) whiteboard video explaining what the term means to me (and my definition is not the same as Wikipedia’s).
Here it is:
For those who prefer the written word, here’s the script (minus the advertising):
“Most organizations are very good at solving problems. This is no surprise: much of training, right from school to university, focuses on teaching us the skills required to solve problems. Now regardless of the specific technique used, the problem-solving process is essentially a logical or analytical one. It goes something like this:
- Gather information about the problem.
- Analyse the information.
- Formulate candidate solutions.
- Implement the solution of choice.
This so-called GAFI method works by breaking problems down into manageable parts, solving each of the parts separately and then assembling these into a solution. The method works very well for most scientific and engineering problems – even one as complicated as sending a spacecraft to Saturn. Indeed, it is so successful that it underpins all of science and modern technology.
However, there is a serious gap in the GAFI method – it assumes that problems are given, it does not tell us how to formulate problems. And as the management luminary, Russell Ackoff once said:
“Outside of school, problems are seldom given; they have to be taken, extracted from complex situations…”
The art of taking problems is what sensemaking is all about.
Unlike analytical thinking, which is purely logical, sensemaking involves such as collaboration, imagination and a healthy tolerance for ambiguity. It is an art that is absolutely essential for surviving…no, thriving, in the increasingly complex world of the 21st century.
The two modes of thinking – sensemaking and analytical – are as different as chalk and cheese but both are necessary for a successful outcome. We like to think of them as lying at opposite ends of a spectrum of thinking styles. When approaching a new situation or problem, one should always begin at the sensemaking end and move towards the analytical end as one understands the problem better. Unfortunately time pressures in corporate environments often force managers and employees into analytical mode without a full appreciation of the problem they are attempting solve. As a result the solutions are often less than optimal. Sensemaking techniques equip organisations with tools that cover the entire problem lifecycle, from definition to solution.”
As a closing remark (that might be construed as advertising…) I’ll mention that I’ve discussed a number of these techniques on Eight to Late. Here are a couple of examples:
Most techniques of predictive analytics have their origins in probability or statistical theory (see my post on Naïve Bayes, for example). In this post I’ll look at one that has more a commonplace origin: the way in which humans make decisions. When making decisions, we typically identify the options available and then evaluate them based on criteria that are important to us. The intuitive appeal of such a procedure is in no small measure due to the fact that it can be easily explained through a visual. Consider the following graphic, for example:
(Original image: https://www.flickr.com/photos/dullhunk/7214525854, Credit: Duncan Hull)
The tree structure depicted here provides a neat, easy-to-follow description of the issue under consideration and its resolution. The decision procedure is based on asking a series of questions, each of which serve to further reduce the domain of possibilities. The predictive technique I discuss in this post,classification and regression trees (CART), works in much the same fashion. It was invented by Leo Breiman and his colleagues in the 1970s.
In what follows, I will use the open source software, R. If you are new to R, you may want to follow this link for more on the basics of setting up and installing it. Note that the R implementation of the CART algorithm is called RPART (Recursive Partitioning And Regression Trees). This is essentially because Breiman and Co. trademarked the term CART. As some others have pointed out, it is somewhat ironical that the algorithm is now commonly referred to as RPART rather than by the term coined by its inventors.
A bit about the algorithm
The rpart algorithm works by splitting the dataset recursively, which means that the subsets that arise from a split are further split until a predetermined termination criterion is reached. At each step, the split is made based on the independent variable that results in the largest possible reduction in heterogeneity of the dependent (predicted) variable.
Splitting rules can be constructed in many different ways, all of which are based on the notion of impurity- a measure of the degree of heterogeneity of the leaf nodes. Put another way, a leaf node that contains a single class is homogeneous and has impurity=0. There are three popular impurity quantification methods: Entropy (aka information gain), Gini Index and Classification Error. Check out this article for a simple explanation of the three methods.
The rpart algorithm offers the entropy and Gini index methods as choices. There is a fair amount of fact and opinion on the Web about which method is better. Here are some of the better articles I’ve come across:
The answer as to which method is the best is: it depends. Given this, it may be prudent to try out a couple of methods and pick the one that works best for your problem.
Regardless of the method chosen, the splitting rules partition the decision space (a fancy word for the entire dataset) into rectangular regions each of which correspond to a split. Consider the following simple example with two predictors x1 and x2. The first split is at x1=1 (which splits the decision space into two regions x1<1 and x1>1), the second at x2=2, which splits the (x1>1) region into 2 sub-regions, and finally x1=1.5 which splits the (x1>1,x2>2) sub-region further.
It is important to note that the algorithm works by making the best possible choice at each particular stage, without any consideration of whether those choices remain optimal in future stages. That is, the algorithm makes a locally optimal decision at each stage. It is thus quite possible that such a choice at one stage turns out to be sub-optimal in the overall scheme of things. In other words, the algorithm does not find a globally optimal tree.
Another important point relates to well-known bias-variance tradeoff in machine learning, which in simple terms is a tradeoff between the degree to which a model fits the training data and its predictive accuracy. This refers to the general rule that beyond a point, it is counterproductive to improve the fit of a model to the training data as this increases the likelihood of overfitting. It is easy to see that deep trees are more likely to overfit the data than shallow ones. One obvious way to control such overfitting is to construct shallower trees by stopping the algorithm at an appropriate point based on whether a split significantly improves the fit. Another is to grow a tree unrestricted and then prune it back using an appropriate criterion. The rpart algorithm takes the latter approach.
Here is how it works in brief:
Essentially one minimises the cost, , a quantity that is a linear combination of the error (essentially, the fraction of misclassified instances, or variance in the case of a continuous variable), and the number of leaf nodes in the tree, :
First, we note that when , this simply returns the original fully grown tree. As increases, we incur a penalty that is proportional to the number of leaf nodes. This tends to cause the minimum cost to occur for a tree that is a subtree of the original one (since a subtree will have a smaller number of leaf nodes). In practice we vary and pick the value that gives the subtree that results in the smallest cross-validated prediction error. One does not have to worry about programming this because the rpart algorithm actually computes the errors for different values of for us. All we need to do is pick the value of the coefficient that gives the lowest cross-validated error. I will illustrate this in detail in the next section.
An implication of their tendency to overfit data is that decision trees tend to be sensitive to relatively minor changes in the training datasets. Indeed, small differences can lead to radically different looking trees. Pruning addresses this to an extent, but does not resolve it completely. A better resolution is offered by the so-called ensemble methods that average over many differently constructed trees. I’ll discuss one such method at length in a future post.
Finally, I should also mention that decision trees can be used for both classification and regression problems (i.e. those in which the predicted variable is discrete and continuous respectively). I’ll demonstrate both types of problems in the next two sections.
Classification trees using rpart
To demonstrate classification trees, we’ll use the Ionosphere dataset available in the mlbench package in R. I have chosen this dataset because it nicely illustrates the points I wish to make in this post. In general, you will almost always find that algorithms that work fine on classroom datasets do not work so well in the real world…but of course, you know that already!
We begin by setting the working directory, loading the required packages (rpart and mlbench) and then loading the Ionosphere dataset.
Next we separate the data into training and test sets. We’ll use the former to build the model and the latter to test it. To do this, I use a simple scheme wherein I randomly select 80% of the data for the training set and assign the remainder to the test data set. This is easily done in a single R statement that invokes the uniform distribution (runif) and the vectorised function, ifelse. Before invoking runif, I set a seed integer to my favourite integer in order to ensure reproducibility of results.
In the above, I have also removed the training flag from the training and test datasets.
Next we invoke rpart. I strongly recommend you take some time to go through the documentation and understand the parameters and their defaults values. Note that we need to remove the predicted variable from the dataset before passing the latter on to the algorithm, which is why we need to find the column index of the predicted variable (first line below). Also note that we set the method parameter to “class“, which simply tells the algorithm that the predicted variable is discrete. Finally, rpart uses Gini rule for splitting by default, and we’ll stick with this option.
The resulting plot is shown in Figure 3 below. It is quite self-explanatory so I won’t dwell on it here.
Next we see how good the model is by seeing how it fares against the test data.
Note that we need to verify the above results by doing multiple runs, each using different training and test sets. I will do this later, after discussing pruning.
Next, we prune the tree using the cost complexity criterion. Basically, the intent is to see if a shallower subtree can give us comparable results. If so, we’d be better of choosing the shallower tree because it reduces the likelihood of overfitting.
As described earlier, we choose the appropriate pruning parameter (aka cost-complexity parameter) by picking the value that results in the lowest prediction error. Note that all relevant computations have already been carried out by R when we built the original tree (the call to rpart in the code above). All that remains now is to pick the value of :
It is clear from the above, that the lowest cross-validation error (xerror in the table) occurs for (this is CP in the table above). One can find CP programatically like so:
Next, we prune the tree based on this value of CP:
Note that rpart will use a default CP value of 0.01 if you don’t specify one in prune.
The pruned tree is shown in Figure 4 below.
Let’s see how this tree stacks up against the fully grown one shown in Fig 3.
This seems like an improvement over the unpruned tree, but one swallow does not a summer make. We need to check that this holds up for different training and test sets. This is easily done by creating multiple random partitions of the dataset and checking the efficacy of pruning for each. To do this efficiently, I’ll create a function that takes the training fraction, number of runs (partitions) and the name of the dataset as inputs and outputs the proportion of correct predictions for each run. It also optionally prunes the tree. Here’s the code:
Note that in the above, I have set the default value of the prune_tree to FALSE, so the function will execute the first branch of the if statement unless the default is overridden.
OK, so let’s do 50 runs with and without pruning, and check the mean and variance of the results for both sets of runs.
So we see that there is an improvement of about 3% with pruning. Also, if you were to plot the trees as we did earlier, you would see that this improvement is achieved with shallower trees. Again, I point out that this is not always the case. In fact, it often happens that pruning results in worse predictions, albeit with better reliability – a classic illustration of the bias-variance tradeoff.
Regression trees using rpart
In the previous section we saw how one can build decision trees for situations in which the predicted variable is discrete. Let’s now look at the case in which the predicted variable is continuous. We’ll use the Boston Housing dataset from the mlbench package. Much of the discussion of the earlier section applies here, so I’ll just display the code, explaining only the differences.
Next we invoke rpart, noting that the predicted variable is medv (median value of owner-occupied homes in $1000 units) and that we need to set the method parameter to “anova“. The latter tells rpart that the predicted variable is continuous (i.e that this is a regression problem).
The plot of the tree is shown in Figure 5 below.
Next, we need to see how good the predictions are. Since the dependent variable is continuous, we cannot compare the predictions directly against the test set. Instead, we calculate the root mean square (RMS) error. To do this, we request rpart to output the predictions as a vector – one prediction per record in the test dataset. The RMS error can then easily be calculated by comparing this vector with the medv column in the test dataset.
Here is the relevant code:
Again, we need to do multiple runs to check on the reliability of the predictions. However, you already know how to do that so I will leave it to you.
Moving on, we prune the tree using the cost complexity criterion as before. The code is exactly the same as in the classification problem.
The tree is unchanged so I won’t show it here. This means, as far as the cost complexity pruning is concerned, the optimal subtree is the same as the original tree. To confirm this, we’d need to do multiple runs as before – something that I’ve already left as as an exercise for you. Basically, you’ll need to write a function analogous to the one above, that computes the root mean square error instead of the proportion of correct predictions.
This brings us to the end of my introduction to classification and regression trees using R. Unlike some articles on the topic I have attempted to describe each of the steps in detail and provide at least some kind of a rationale for them. I hope you’ve found the description and code snippets useful.
I’ll end by reiterating a couple points I made early in this piece. The nice thing about decision trees is that they are easy to explain to the users of our predictions. This is primarily because they reflect the way we think about how decisions are made in real life – via a set of binary choices based on appropriate criteria. That said, in many practical situations decision trees turn out to be unstable: small changes in the dataset can lead to wildly different trees. It turns out that this limitation can be addressed by building a variety of trees using different starting points and then averaging over them. This is the domain of the so-called random forest algorithm.We’ll make the journey from decision trees to random forests in a future post.
Until then, thanks for reading. Your comments & criticisms are welcomed, as always.
An irony of organisational life is that the most important decisions on projects (or any other initiatives) have to be made at the start, when ambiguity is at its highest and information availability lowest. I recently gave a talk at the Pune office of BMC Software on improving decision-making in such situations.
The talk was recorded and simulcast to a couple of locations in India. The folks at BMC very kindly sent me a copy of the recording with permission to publish it on Eight to Late. Here it is:
Based on the questions asked and the feedback received, I reckon that a number of people found the talk useful. I’d welcome your comments/feedback.
Acknowledgements: My thanks go out to Gaurav Pal, Manish Gadgil and Mrinalini Wankhede for giving me the opportunity to speak at BMC, and to Shubhangi Apte for putting me in touch with them. Finally, I’d like to thank the wonderful audience at BMC for their insightful questions and comments.
Enterprise architects are seldom (never?) given a blank canvas on which they can draw as they please. They invariably have to begin with an installed base of systems over which they have no control. As I wrote in a piece on the legacy of legacy systems:
An often unstated (but implicit) requirement [on new systems] is that [they] must maintain continuity between the past and present. This is true even for systems that claim to represent a clean break from the past; one never has the luxury of a completely blank slate, there are always arbitrary constraints placed by legacy systems.
Indeed the system landscape of any large organization is a palimpsest, always retaining traces of what came before. Those who actually maintain systems – usually not architects – are painfully aware of this simple truth.
The IT landscape of an organization is therefore a snapshot, a picture that begins to age the instant is taken. Practicing enterprise architects will say they know this “of course”, and pay due homage to it in their words…but often not their actions. The conflicts and contradictions between legacy and their aspirational architectures are hard to deal with and hence easier to ignore. In this post, I draw a parallel between this central conundrum of enterprise architecture and the process of biological evolution.
A Batesonian perspective on evolution
I’ve recently been re-reading Mind and Nature: A Necessary Unity, a book that Gregory Bateson wrote towards the end of his life of eclectic scholarship. Tucked away in the appendix of the book is an essay lamenting the fragmentation of knowledge and the lack of transdisciplinary thinking within universities. Central to the essay is the notion of obsolescence. Bateson argued that much of what was taught in universities lagged behind the practical skills and mindsets that were needed to tackle the problems of that time. Most people would agree that this is as true today as it was in Bateson’s time, perhaps even more so.
Bateson had a very specific idea of obsolescence in mind. He suggested that the educational system is lopsided because it invariably lags behind what is needed in the “real world”. Specifically, there is a lag between the typical university curriculum and the attitudes, dispositions, knowledge and skills needed to the problems of an ever-changing world. This lag is what Bateson referred to as obsolescence. Indeed, if the external world did not change there would be no lag and hence no obsolescence. As he noted:
I therefore propose to analyze the lopsided process called “obsolescence” which we might more precisely call “one-sided progress.” Clearly for obsolescence to occur there must be, in other parts of the system, other changes compared with which the obsolete is somehow lagging or left behind. In a static system, there would be no obsolescence…
This notion of obsolescence-as-lag has a direct connection with the contrasting process of developmental and evolutionary biology. The process of development of an embryo is inherently conservative – it develops according predetermined rules and is relatively robust to external stimuli. On the other hand, after birth, individuals are continually subject to a wide range of external factors (e.g. climate, stress etc.) that are unpredictable. If exposed to such factors over an extended period, they may change their characteristics in response to them (e.g. the tanning effect of sunlight, adaptability etc). However, these characteristics are not inheritable. They are passed on (if at all) by a much slower process of natural selection. As a consequence, there is a significant lag between external stimuli and the inheritability of the associated characteristics.
As Bateson puts it:
Survival depends upon two contrasting phenomena or processes, two ways of achieving adaptive action. Evolution must always, Janus-like, face in two directions: inward towards the developmental regularities and physiology of the living creature and outward towards the vagaries and demands of the environment. These two necessary components of life contrast in interesting ways: the inner development-the embryology or “epigenesis”-is conservative and demands that every new thing shall conform or be compatible with the regularities of the status quo ante. If we think of a natural selection of new features of anatomy or physiology-then it is clear that one side of this selection process will favor those new items which do not upset the old apple cart. This is minimal necessary conservatism.
In contrast, the outside world is perpetually changing and becoming ready to receive creatures which have undergone change, almost insisting upon change. No animal or plant can ever be “readymade.” The internal recipe insists upon compatibility but is never sufficient for the development and life of the organism. Always the creature itself must achieve change of its own body. It must acquire certain somatic characteristics by use, by disuse, by habit, by hardship, and by nurture. These “acquired characteristics” must, however, never be passed on to the offspring. They must not be directly incorporated into the DNA. In organisational terms, the injunction – e.g. to make babies with strong shoulders who will work better in coal mines- must be transmitted through channels, and the channel in this case is via natural external selection of those offspring who happen (thanks to the random shuffling of genes and random creation of mutations) to have a greater propensity for developing stronger shoulders under the stress of working in coal mine.
The upshot of the above is that the genetic code of any species is inherently obsolete because it is, in at least a few ways, maladapted to its environment. This is a good thing. Sustainable and lasting change to the genome of a population should occur only through the trial-and-error process of natural selection over many generations. It is only through such a gradual process that one can be sure that that a) the adaptation is necessary and b) that it occurs with minimal disruption to the existing order.
…and so to enterprise architecture
In essence, the aim of enterprise architecture is to come up with a strategy and plan to move from an old system landscape to a new one. Typically, architectures are proposed based on current technology trends and extrapolations thereof. Frameworks such as The Open Group Architecture Framework (TOGAF) present a range of options for migrating from legacy architecture.
Here’s an excerpt from Chapter 13 of the TOGAF Guide:
[The objective is to] create an overall Implementation and Migration Strategy that will guide the implementation of the Target Architecture, and structure any Transition Architectures. The first activity is to determine an overall strategic approach to implementing the solutions and/or exploiting opportunities. There are three basic approaches as follows:
- Greenfield: A completely new implementation.
- Revolutionary: A radical change (i.e., switches on, switch off).
- Evolutionary: A strategy of convergence, such as parallel running or a phased approach to introduce new capabilities.
What can we say about these options in light of the discussion of the previous sections?
Firstly, from the discussion of the introduction, it is clear that Greenfield situations can be discounted on grounds rarity alone. So let’s look at the second option – revolutionary change – and ask if it is viable in light of the discussion of the previous section.
In the case of a particular organization, the gap between an old architecture and technology trends/extrapolations is analogous to the lag between inherited characteristics and external forces. The former resist change; the latter insist on it. The discussion of the previous section tells us that the former cannot be wished away, they are a natural consequence of “technology genes” embedded in the organization. Because this is so, changes are best introduced in a gradual way that permits adaptation through the slow and painful process of trial and error. This is why the revolutionary approach usually fails.
It follows from the above that the only viable approach to enterprise architecture is an evolutionary one. This process is necessarily gradual. Architects may wish for green fields and revolutions, but the reality is that lasting and sustainable change in an organisation’s technology landscape can only be achieved incrementally, akin to the way in which an aspiring marathon runner’s physiology adapts to the extreme demands of the sport.
The other, perhaps more subtle point made by this analogy is that a particular organization is but one member of a “species” which, in the present context, is a population of organisations that have a certain technology landscape. Clearly, a new style of architecture will be deemed a success only if it is adopted successfully by a significant number of organisations within this population. Equally clear is that this eventuality is improbable because new architectural paradigms are akin to random mutations. Most of these are rightly rejected by organizations, but only after exacting a high price. This explains why most technology fads tend to fade away.
The analogy between the evolution of biological systems and organizational technology landscapes has some interesting implications for enterprise architects. Here are a few that are worth highlighting:
- Enterprise architects are caught between a rock and a hard place: to demonstrate value they have to change things rapidly, but rapid changes are more likely to fail than succeed.
- The best chance of success lies in an evolutionary approach that accepts trial and error as a natural part of the process. The trick lies in selling that to management…and there are ways to do that.
- A corollary of (2) is that old and new elements of the landscape will necessarily have to coexist, often for periods much longer than one might expect. One must therefore design for coexistence. Above all, the focus here should be on the interfaces for these are the critical elements that enable the old and the new to “talk” to each other.
- Enterprise architects should be skeptical of cutting edge technologies. It almost always better to bet on proven technologies because they have the benefit of the experience of others.
- One of the consequences of an evolutionary process of trial and error is that benefits (or downsides) are often not evident upfront. One must therefore always keep an eye out for these unexpected features.
Finally, it is worth pointing out that an interesting feature of all the above points is that they are consistent with the principles of emergent design.
In this article I’ve attempted to highlight a connection between the evolution of organizational technology landscapes and the process of biological evolution. At the heart of both lie a fundamental tension between inherent conservatism (the tendency to preserve the status quo change) and the imperative to evolve in order to adapt to changes imposed by the environment. There is no question that maintaining the status quo is never an option. The question is how to evolve in order to ensure the best chance of success. Evolution tells us that the best approach is a gradual one, via a process of trial, error and learning.
Graph theory is the an area of mathematics that analyses relationships between pairs of objects. Typically graphs consist of nodes (points representing objects) and edges (lines depicting relationships between objects). As one might imagine, graphs are extremely useful in visualizing relationships between objects. In this post, I provide a detailed introduction to network graphs using R, the premier open source tool statistics package for calculations and the excellent Gephi software for visualization.
The article is organised as follows: I begin by defining the problem and then spend some time developing the concepts used in constructing the graph Following this, I do the data preparation in R and then finally build the network graph using Gephi.
In an introductory article on cluster analysis, I provided an in-depth introduction to a couple of algorithms that can be used to categorise documents automatically. Although these techniques are useful, they do not provide a feel for the relationships between different documents in the collection of interest. In the present piece I show network graphs can be used to to visualise similarity-based relationships within a corpus.
There are many ways to quantify similarity between documents. A popular method is to use the notion of distance between documents. The basic idea is simple: documents that have many words in common are “closer” to each other than those that share fewer words. The problem with distance, however, is that it can be skewed by word count: documents that have an unusually high word count will show up as outliers even though they may be similar (in terms of words used) to other documents in the corpus. For this reason, we will use another related measure of similarity that does not suffer from this problem – more about this in a minute.
Representing documents mathematically
As I explained in my article on cluster analysis, a document can be represented as a point in a conceptual space that has dimensionality equal to the number of distinct words in the collection of documents. I revisit and build on that explanation below.
Say one has a simple document consisting of the words “five plus six”, one can represent it mathematically in a 3 dimensional space in which the individual words are represented by the three axis (See Figure 1). Here each word is a coordinate axis (or dimension). Now, if one connects the point representing the document (point A in the figure) to the origin of the word-space, one has a vector, which in this case is a directed line connecting the point in question to the origin. Specifically, the point A can be represented by the coordinates in this space. This is a nice quantitative representation of the fact that the words five, plus and one appear in the document exactly once. Note, however, that we’ve assumed the order of words does not matter. This is a reasonable assumption in some cases, but not always so.
As another example consider document, B, which consists of only two words: “five plus” (see Fig 2). Clearly this document shares some similarity with document but it is not identical. Indeed, this becomes evident when we note that document (or point) B is simply the point $latex(1, 1, 0)$ in this space, which tells us that it has two coordinates (words/frequencies) in common with document (or point) A.
To be sure, in a realistic collection of documents we would have a large number of distinct words, so we’d have to work in a very high dimensional space. Nevertheless, the same principle holds: every document in the corpus can be represented as a vector consisting of a directed line from the origin to the point to which the document corresponds.
Now it is easy to see that two documents are identical if they correspond to the same point. In other words, if their vectors coincide. On the other hand, if they are completely dissimilar (no words in common), their vectors will be at right angles to each other. What we need, therefore, is a quantity that varies from 0 to 1 depending on whether two documents (vectors) are dissimilar(at right angles to each other) or similar (coincide, or are parallel to each other).
Now here’s the ultra-cool thing, from your high school maths class, you know there is a trigonometric ratio which has exactly this property – the cosine!
What’s even cooler is that the cosine of the angle between two vectors is simply the dot product of the two vectors, which is sum of the products of the individual elements of the vector, divided by the product of the lengths of the two vectors. In three dimensions this can be expressed mathematically as:
where the two vectors are and , and is the angle between the two vectors (see Fig 2).
The upshot of the above is that the cosine of the angle between the vector representation of two documents is a reasonable measure of similarity between them. This quantity, sometimes referred to as cosine similarity, is what we’ll take as our similarity measure in the rest of this article.
The adjacency matrix
If we have a collection of documents, we can calculate the similarity between every pair of documents as we did for A and B in the previous section. This would give us a set of numbers between 0 and 1, which can be conveniently represented as a matrix. This is sometimes called the adjacency matrix. Beware, though, this term has many different meanings in the math literature. I use it in the sense specified above.
Since every document is identical to itself, the diagonal elements of the matrix will all be 1. These similarities are trivial (we know that every document is identical to itself!) so we’ll set the diagonal elements to zero.
Another important practical point is that visualizing every relationship is going to make a very messy graph. There would be edges in such a graph, which would make it impossible to make sense of if we have more than a handful of documents. For this reason, it is normal practice to choose a cutoff value of similarity below which it is set to zero.
Building the adjacency matrix using R
We now have enough background to get down to the main point of this article – visualizing relationships between documents.
The first step is to build the adjacency matrix. In order to do this, we have to build the document term matrix (DTM) for the collection of documents, a process which I have dealt with at length in my introductory pieces on text mining and topic modeling. In fact, the steps are actually identical to those detailed in the second piece. I will therefore avoid lengthy explanations here. However, I’ve listed all the code below with brief comments (for those who are interested in trying this out, the document corpus can be downloaded here and a pdf listing of the R code can be obtained here.)
OK, so here’s the code listing:
docs <- tm_map(docs, toSpace, “-“)
docs <- tm_map(docs, toSpace, “’”)
docs <- tm_map(docs, toSpace, “‘”)
docs <- tm_map(docs, toSpace, “•”)
docs <- tm_map(docs, toSpace, “””)
docs <- tm_map(docs, toSpace, ““”)
pattern = “organiz”, replacement = “organ”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “organis”, replacement = “organ”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “andgovern”, replacement = “govern”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “inenterpris”, replacement = “enterpris”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “team-“, replacement = “team”)
The rows of a DTM are document vectors akin to the vector representations of documents A and B discussed earlier. The DTM therefore contains all the information we need to calculate the cosine similarity between every pair of documents in the corpus (via equation 1). The R code below implements this, after taking care of a few preliminaries.
A few lines need a brief explanation:
First up, although the DTM is a matrix, it is internally stored in a special form suitable for sparse matrices. We therefore have to explicitly convert it into a proper matrix before using it to calculate similarity.
Second, the names I have given the documents are way too long to use as labels in the network diagram. I have therefore mapped the document names to the row numbers which we’ll use in our network graph later. The mapping back to the original document names is stored in filekey.csv. For future reference, the mapping is shown in Table 1 below.
Table 1: File mappings
Finally, the distance function (as.dist) in the cosine similarity function sets the diagonal elements to zero because the distance between a document and itself is zero…which is just a complicated way of saying that a document is identical to itself
The last three lines of code above simply implement the cutoff that I mentioned in the previous section. The comments explain the details so I need say no more about it.
…which finally brings us to Gephi.
Visualizing document similarity using Gephi
Gephi is an open source, Java based network analysis and visualisation tool. Before going any further, you may want to download and install it. While you’re at it you may also want to download this excellent quick start tutorial.
Go on, I’ll wait for you…
To begin with, there’s a little formatting quirk that we need to deal with. Gephi expects separators in csv files to be semicolons (;) . So, your first step is to open up the adjacency matrix that you created in the previous section (AdjacencyMatrix.csv) in a text editor and replace commas with semicolons.
Once you’ve done that, fire up Gephi, go to File > Open, navigate to where your Adjacency matrix is stored and load the file. If it loads successfully, you should see a feedback panel as shown in Figure 3. By default Gephi creates a directed graph (i.e one in which the edges have arrows pointing from one node to another). Change this to undirected and click OK.
Once that is done, click on overview (top left of the screen). You should end up with something like Figure 4.
Gephi has sketched out an initial network diagram which depicts the relationships between documents…but it needs a bit of work to make it look nicer and more informative. The quickstart tutorial mentioned earlier describes various features that can be used to manipulate and prettify the graph. In the remainder of this section, I list some that I found useful. Gephi offers many more. Do explore, there’s much more than I can cover in an introductory post.
First some basics. You can:
- Zoom and pan using mouse wheel and right button.
- Adjust edge thicknesses using the slider next to text formatting options on bottom left of main panel.
- Re-center graph via the magnifying glass icon on left of display panel (just above size adjuster).
- Toggle node labels on/off by clicking on grey T symbol on bottom left panel.
Figure 5 shows the state of the diagram after labels have been added and edge thickness adjusted (note that your graph may vary in appearance).
The default layout of the graph is ugly and hard to interpret. Let’s work on fixing it up. To do this, go over to the layout panel on the left. Experiment with different layouts to see what they do. After some messing around, I found the Fruchtermann-Reingold and Force Atlas options to be good for this graph. In the end I used Force Atlas with a Repulsion Strength of 2000 (up from the default of 200) and an Attraction Strength of 1 (down from the default of 10). I also adjusted the figure size and node label font size from the graph panel in the center. The result is shown in Figure 6.
This is much better. For example, it is now evident that document 9 is the most connected one (which table 9 tells us is a transcript of a conversation with Neil Preston on organisational change).
It would be nice if we could colour code edges/nodes and size nodes by their degree of connectivity. This can be done via the ranking panel above the layout area where you’ve just been working.
In the Nodes tab select Degree as the rank parameter (this is the degree of connectivity of the node) and hit apply. Select your preferred colours via the small icon just above the colour slider. Use the colour slider to adjust the degree of connectivity at which colour transitions occur.
Do the same for edges, selecting weight as the rank parameter(this is the degree of similarity between the two douments connected by the edge). With a bit of playing around, I got the graph shown in the screenshot below (Figure 7).
If you want to see numerical values for the rankings, hit the results list icon on the bottom left of the ranking panel. You can see numerical ranking values for both nodes and edges as shown in Figures 8 and 9.
It is easy to see from the figure that documents 21 and 29 are the most similar in terms of cosine ranking. This makes sense, they are pieces in which I have ranted about the current state of enterprise architecture – the first article is about EA in general and the other about the TOGAF framework. If you have a quick skim through, you’ll see that they have a fair bit in common.
Finally, it would be nice if we could adjust node size to reflect the connectedness of the associated document. You can do this via the “gem” symbol on the top right of the ranking panel. Select appropriate min and max sizes (I chose defaults) and hit apply. The node size is now reflective of the connectivity of the node – i.e. the number of other documents to which it is cosine similar to varying degrees. The thickness of the edges reflect the degree of similarity. See Figure 10.
Now that looks good enough to export. To do this, hit the preview tab on main panel and make following adjustments to the default settings:
Under Node Labels:
1. Check Show Labels
2. Uncheck proportional size
3. Adjust font to required size
1. Change thickness to 10
2. Check rescale weight
Hit refresh after making the above adjustments. You should get something like Fig 11.
All that remains now is to do the deed: hit export SVG/PDF/PNG to export the diagram. My output is displayed in Figure 12. It clearly shows the relationships between the different documents (nodes) in the corpus. The nodes with the highest connectivity are indicated via node size and colour (purple for high, green for low) and strength of similarity is indicated by edge thickness.
…which brings us to the end of this journey.
The techniques of text analysis enable us to quantify relationships between documents. Document similarity is one such relationship. Numerical measures are good, but the comprehensibility of these can be further enhanced through meaningful visualisations. Indeed, although my stated objective in this article was to provide an introduction to creating network graphs using Gephi and R (which I hope I’ve succeeded in doing), a secondary aim was to show how document similarity can be quantified and visualised. I sincerely hope you’ve found the discussion interesting and useful.
Many thanks for reading! As always, your feedback would be greatly appreciated.
A mainstay of team building workshops is the old “what can we do better” exercise. Over the years I’ve noticed that “improving communication” is an item that comes up again and again in these events. This is frustrating for managers. For example, during a team-building debrief some years ago, an exasperated executive remarked, “Oh don’t pay any attention to that [better communication], it keeps coming up no matter what we do.”
The executive had a point. The organisation had invested much effort in establishing new channels of communication – social media, online, face-to-face forums etc. The uptake, however, was disappointing: turnout at the face-to-face meetings was consistently low as was use of other channels.
As far as management was concerned, they had done their job by establishing communication channels and making them available to all. What more could they be expected to do? The matter was dismissed with a collective shrug of suit-clad shoulders…until the next team building event, when the issue was highlighted by employees yet again.
After much hand-wringing, the organisation embarked on another “better communication cycle.” Much effort was expended…again, with the same disappointing results.
Anecdotal evidence via conversations with friends and collaborators suggests that variants of this story play out in many organisations. This makes the issue well worth exploring. I won’t be so presumptuous as to offer answers; I’m well aware that folks much better qualified than I have spent years attempting to do so. Instead I raise a point which, though often overlooked, might well have something to do with the lack of genuine communication in organisations.
Communication experts have long agreed that face-to-face dialogue is the most effective mode of communication. Backing for this comes from the interactional or pragmatic view, which is based on the premise that communication is more about building relationships than conveying information. Among other things, face-to-face communication enables the communicating parties to observe and interpret non-verbal signals such as facial expression and gestures and, as we all know, these often “say” much more than what’s being said.
A few months ago I started paying closer attention to non-verbal cues. This can be hard to do because people are good at disguising their feelings. Involuntary expressions indicative of people’s real thoughts can be fleeting. A flicker of worry, fear or anger is quickly covered by a mask of indifference.
In meetings, difficult topics tend to be couched in platitudinous language. Platitudes are empty words that sound great but can be interpreted in many different ways. Reconciling those differences often leads to pointless arguments that are emotionally draining. Perhaps this is why people prefer to take refuge in indifference.
A while ago I was sitting in a meeting where the phrase “value add activity” (sic) cropped up once, then again…and then many times over. Soon it was apparent that everyone in the room had a very different conception of what constituted a “value add activity.” Some argued that project management is a value add activity, others disagreed vehemently arguing that project management is a bureaucratic exercise and that real value lies in creating something. Round and round the arguments went but there was no agreement on what constituted a “value add activity.” The discussion generated a lot of heat but shed no light whatsoever on the term.
A problem with communication in the corporate world is that it is loaded with such platitudes. To make sense of these, people have to pay what I call a “value add” tax – the effort in reaching a consensus on what the platitudinous terms mean. This can be emotionally extortionate because platitudes often touch upon issues that affect people’s sense of well-being.
Indifference is easier because we can then pretend to understand and agree with each other when we would rather not understand, let alone agree, at all.