Eight to Late

Sensemaking and Analytics for Organizations

Posts Tagged ‘Data Management

Tackling the John Smith Problem – deduplicating data via fuzzy matching in R

with 2 comments

Last week I attended a CRM  & data user group meeting for not-for-profits (NFPs), organized by my friend Yael Wasserman from Mission Australia. Following a presentation from a vendor, we broke up into groups and discussed common data quality issues that NFPs (and dare I say most other organisations) face. Number one on the list was the vexing issue of duplicate constituent (donor) records – henceforth referred to as dupes. I like to call this the John Smith Problem as it is likely that a typical customer database in a country with a large Anglo population is likely to have a fair number of records for customers with that name.  The problem is tricky because one has to identify John Smiths who appear to be distinct in the database but are actually the same person, while also ensuring that one does not inadvertently merge two distinct John Smiths.

The John Smith problem is particularly acute for NFPs as much of their customer data comes in either via manual data entry or bulk loads with less than optimal validation. To be sure, all the NFPs represented at the meeting have some level of validation on both modes of entry, but all participants admitted that dupes tend to sneak in nonetheless…and at volumes that merit serious attention.  Yael and his team have had some success in cracking the dupe problem using SQL-based matching of a combination of fields such as first name, last name and address or first name, last name and phone number and so on. However, as he pointed out, this method is limited because:

  1. It does not allow for typos and misspellings.
  2. Matching on too few fields runs the risk of false positives – i.e. labelling non-dupes as dupes.

The problems arise because SQL-based matching requires  one to pre-specify match patterns. The solution is straightforward: use fuzzy matching instead. The idea behind fuzzy matching is simple:  allow for inexact matches, assigning each match a similarity score ranging from 0 to 1 with 0 being complete dissimilarity and 1 being a perfect match. My primary objective in this article is to show how one can make headway with the John Smith problem using the fuzzy matching capabilities available in R.

A bit about fuzzy matching

Before getting down to fuzzy matching, it is worth a brief introduction on how it works. The basic idea is simple: one has to generalise the notion of a match from a binary “match” / “no match” to allow for partial matching. To do this, we need to introduce the notion of an edit distance, which is essentially the minimum number of operations required to transform one string into another. For example, the edit distance between the strings boy and bay is 1: there’s only one edit required to transform one string to the other. The Levenshtein distance is the most commonly used edit distance. It is essentially, “the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.”

A variant called the  Damerau-Levenshtein distance, which additionally allows for the transposition of two adjacent characters (counted as one operation, not two) is found to be more useful in practice.  We’ll use an implementation of this called the optimal string alignment (osa) distance. If you’re interested in finding out more about osa, check out the Damerau-Levenshtein article linked to earlier in this paragraph.

Since longer strings will potentially have larger numeric distances between them, it makes sense to normalise the distance to a value lying between 0 and 1. We’ll do this by dividing the calculated osa distance by the length of the larger of the two strings . Yes, this is crude but, as you will see, it works reasonably well. The resulting number is a normalised measure of the dissimilarity between the two strings. To get a similarity measure we simply subtract the dissimilarity from 1. So, a normalised dissimilarity of 1 translates to similarity score of 0 – i.e. the strings are perfectly dissimilar.  I hope I’m not belabouring the point; I just want to make sure it is perfectly clear before going on.

Preparation

In what follows, I assume you have R and RStudio installed. If not, you can access the software here and here for Windows and here for Macs; installation for both products is usually quite straightforward.

You may also want to download the Excel file many_john_smiths which contains records for ten fictitious John Smiths. At this point I should affirm that as far as the dataset is concerned, any resemblance to actual John Smiths, living or dead, is purely coincidental! Once you have downloaded the file you will want to open it in Excel and examine the records and save it as a csv file in your R working directory (or any other convenient place)  for processing in R.

As an aside, if you have access to a database, you may also want to load the file into a table called many_john_smiths and run the following dupe-detecting SQL statement:


select * from many_john_smiths t1
where exists
(select 'x' from many_john_smiths t2
where
t1.FirstName=t2.FirstName
and
t1.LastName=t2.LastName
and
t1.AddressPostcode=t2.AddressPostcode
and
t1.CustomerID <> t2.CustomerID)

You may also want to try matching on other column combinations such as First/Last Name and AddressLine1 or First/Last Name and AddressSuburb for example. The limitations of column-based exact matching will be evident immediately. Indeed,  I have deliberately designed the records to highlight some of the issues associated with dirty data: misspellings, typos, misheard names over the phone etc. A quick perusal of the records will show that there are probably two distinct John Smiths in the list. The problem is to quantify this observation. We do that next.

Tackling the John Smith problem using R

We’ll use the following libraries: stringdist, stringi and rockchalk.  The first library, stringdist, contains a bunch of string distance functions, we’ll use stringdistmatrix() which returns a matrix of pairwise string distances (osa by default) when  passed a vector of strings; stringi has a number of string utilities from which we’ll use str_length(), which returns the length of string; rockchalk contains a useful function vech2mat() which casts a half-vector to a triangular matrix (this will make more sense when you see it in action).

OK, so on to the code. The first step is to load the required libraries:

#load libraries
library("stringdist")
library("stringr")
library("rockchalk")

We then read in the data, ensuring that we override the annoying default behaviour of R, which is to convert strings to categorical variables – we want strings to remain strings!

#read data, taking care to ensure that strings remain strings
df <- read.csv("many_john_smiths.csv",stringsAsFactors = F)
#examine dataframe
str(df)

The output from str(df) (not shown)  indicates that all columns barring CustomerID are indeed strings (i.e. type=character).

The next step is to find the length of each row:

#find length of string formed by each row (excluding title)
rowlen <- str_length(paste0(df$FirstName,df$LastName,df$AddressLine1,
df$AddressPostcode,df$AddressSuburb,df$Phone))
#examine row lengths
rowlen
> [1] 41 43 39 42 28 41 42 42 42 43

Note that I have excluded the Title column as I did not think it was relevant to determining duplicates.

Next we find the distance between every pair of records in the dataset. We’ll use the stringdistmatrix()function mentioned earlier:


#stringdistmatrix - finds pairwise osa distance between every pair of elements in a
#character vector
d <- stringdistmatrix(paste0(df$FirstName,df$LastName,df$AddressLine1,
df$AddressPostcode,df$AddressSuburb,df$Phone))

stringdistmatrix() returns an object of type dist (distance), so we’ll cast it into a matrix object for use later. We’ll also set the diagonal entries to 0 (since we know that the distance between a string and itself is zero):

#cast d as a proper matrix
d_matrix <- as.matrix(d)
#set diagonals to 0 - distance between a string and itself is 0
diag(d_matrix) <- 0

For reasons that will become clear later, it is convenient to normalise the distance – i.e. scale it to a number that lies between 0 and 1. We’ll do this by dividing the distance between two strings by the length of the longer string. We’ll use the nifty base R function combn() to compute the maximum length for every pair of strings:

#find the length of the longer of two strings in each pair
pwmax <- combn(rowlen,2,max,simplify = T)

The first argument is the vector from which combinations are to be generated, the second is the group size (2, since we want pairs) and the third argument indicates whether or not the result should be returned as an array (simplify=T) or list (simplify=F). The returned object, pwmax, is a one-dimensional array containing the pairwise maximum lengths. We convert this to a proper matrix using the vech2mat() function from rockchalk.

#convert the resulting vector to a matrix, the diagonal entries should be the
#respective rowlength
pmax_matrix <- vech2mat(pwmax,diag=rowlen)

Now we divide the distance matrix by the maximum length matrix in order to obtain a normalised distance:

#normalise distance
norm_dist_matrix <- d_matrix/pmax_matrix

The normalised distance lies between 0 and 1 (check this!) so we can define similarity as 1-distance:

#similarity = 1 - distance
sim_matrix <-round(1-norm_dist_matrix,2)
sim_matrix
1 2 3 4 5 6 7 8 9 10
1 1.00 0.84 0.76 0.64 0.54 0.46 0.52 0.76 0.55 0.60
2 0.84 1.00 0.70 0.51 0.40 0.51 0.47 0.70 0.49 0.49
3 0.76 0.70 1.00 0.43 0.33 0.32 0.38 0.60 0.55 0.42
4 0.64 0.51 0.43 1.00 0.64 0.71 0.79 0.52 0.50 0.70
5 0.54 0.40 0.33 0.64 1.00 0.56 0.50 0.45 0.43 0.49
6 0.46 0.51 0.32 0.71 0.56 1.00 0.67 0.40 0.31 0.56
7 0.52 0.47 0.38 0.79 0.50 0.67 1.00 0.48 0.45 0.63
8 0.76 0.70 0.60 0.52 0.45 0.40 0.48 1.00 0.48 0.49
9 0.55 0.49 0.55 0.50 0.43 0.31 0.45 0.48 1.00 0.44
10 0.60 0.49 0.42 0.70 0.49 0.56 0.63 0.49 0.44 1.00
#write out the similarity matrix
write.csv(sim_matrix,file="similarity_matrix.csv")

The similarity matrix looks quite reasonable: you can, for example, see that records 1 and 2 (similarity score=0.84) are quite similar while records 1 and 6 are quite dissimilar (similarity score=0.46).  Now let’s extract some results more systematically. We’ll do this by printing out the top 5 non-diagonal similarity scores and the associated records for each of them. This needs a bit of work. To start with, we note that the similarity matrix (like the distance matrix) is symmetric so we’ll convert it into an upper triangular matrix to avoid double counting. We’ll also set the diagonal entries to 0 to avoid comparing a record with itself:

#convert to upper triangular to prevent double counting
sim_matrix[lower.tri(sim_matrix)] <- 0
#set diagonals to zero to avoid comparing row to itself
diag(sim_matrix) <- 0

Next we create a function that returns the n largest similarity scores and their associated row and column number – we’ll need the latter to identify the pair of records that are associated with each score:

#adapted from:
#https://stackoverflow.com/questions/32544566/find-the-largest-values-on-a-matrix-in-r
nlargest <- function(m, n) {
res <- order(m, decreasing = T)[seq_len(n)];
pos <- arrayInd(res, dim(m), useNames = TRUE);
list(values = m[res],
position = pos)
}

The function takes two arguments: a matrix m and a number n indicating the top n scores to be returned. Let’s set this number to 5 – i.e. we want the top 5 scores and the associated record indexes. We’ll store the output of nlargest in the variable sim_list:

top_n <- 5
sim_list <- nlargest(sim_matrix,top_n)

Finally, we loop through sim_list printing out the scores and associated records as we go along:

for (i in 1:top_n){
rec <- as.character(df[sim_list$position[i],])
sim_rec <- as.character(df[sim_list$position[i+top_n],])
cat("score: ",sim_list$values[i],"\n")
cat("record 1: ",rec,"\n")
cat ("record 2: ",sim_rec,"\n\n")
}
score: 0.84
record 1: 1 John Smith Mr 12 Acadia Rd Burnton 9671 1234 5678
record 2: 2 Jhon Smith Mr 12 Arcadia Road Bernton 967 1233 5678


score: 0.79
record 1: 4 John Smith Mr 13 Kynaston Rd Burnton 9671 34561234
record 2: 7 Jon Smith Mr. 13 Kinaston Rd Barnston 9761 36451223


score: 0.76
record 1: 1 John Smith Mr 12 Acadia Rd Burnton 9671 1234 5678
record 2: 3 J Smith Mr. 12 Acadia Ave Burnton 867`1 1233 567


score: 0.76
record 1: 1 John Smith Mr 12 Acadia Rd Burnton 9671 1234 5678
record 2: 8 John Smith Dr 12 Aracadia St Brenton 9761 12345666


score: 0.71
record 1: 4 John Smith Mr 13 Kynaston Rd Burnton 9671 34561234
record 2: 6 John S Dr. 12 Kinaston Road Bernton 9677 34561223

As you can see, the record correctly identifies close matches: there appear to be 2 distinct records (1 and 4) – and possibly more, depending on where one sets the similarity threshold. I’ll leave you to explore this further on your own.

The John Smith problem in real life

As a proof of concept, I ran the following SQL on a real CRM database hosted on SQL Server:

select
FirstName+LastName,
count(*)
from
TableName
group by
FirstName+LastName
having
count(*)>100
order by
count(*) desc

I was gratified to note that John Smith did indeed come up tops – well over 200 records. I suspected there were a few duplicates lurking within, so I extracted the records and ran the above R code (with a few minor changes). I found there indeed were some duplicates! I also observed that the code ran with no noticeable degradation despite the dataset having well over 10 times the number of records used in the toy example above. I have not run it for larger datasets yet, but I suspect one will run into memory issues when the number of records gets into the thousands. Nevertheless, based on my experimentation thus far, this method appears viable for small datasets.

The problem of deduplicating large datasets is left as an exercise for motivated readers 😛

Wrapping up

Often organisations will turn to specialist consultancies to fix data quality issues only to find that their work, besides being quite pricey, comes with a lot of caveats and cosmetic fixes that do not address the problem fully.  Given this, there is a case to be made for doing as much of the exploratory groundwork as one can so that one gets a good idea of what can be done and what cannot. At the very least, one will then be able to keep one’s consultants on their toes. In my experience, the John Smith problem ranks right up there in the list of data quality issues that NFPs and many other organisations face. This article is intended as a starting point to address this issue using an easily available and cost effective technology.

Finally,  I should reiterate that the approach discussed here is just one of many possible and is neither optimal nor efficient.  Nevertheless, it works quite well on small datasets, and is therefore offered here as a starting point for your own attempts at tackling the problem. If you come up with something better – as I am sure you can – I’d greatly appreciate your letting me know via the contact page on this blog or better yet, a comment.

Acknowledgements:

I’m indebted to Homan Zhao and Sree Acharath for helpful conversations on fuzzy matching.  I’m also grateful to  all those who attended the NFP CRM and Data User Group meetup that was held earlier this month – the discussions at that meeting inspired this piece.

Written by K

October 9, 2019 at 8:49 pm

Posted in Data Analytics, Data Science, R

Tagged with

Beyond entities and relationships – towards an emergent approach to data modelling

with 3 comments

Introduction – some truths about data modelling

It has been said that data is the lifeblood of business.  The aptness of this metaphor became apparent to when I was engaged in a data mapping project some years ago. The objective of that effort was to document all the data flows within the organization.   The final map showed very clearly that the volume of data on the move was good indicator of the activity of the function: the greater the volume, the more active the function. This is akin to the case of the human body wherein organs that expend the more energy tend to have a richer network of blood vessels.

Although the above analogy is far from perfect, it serves to highlight the simple fact that most business activities involve the movement and /or processing of data.  Indeed, the key function of information systems that support business activities is to operate on and transfer data.   It therefore matters a lot as to how data is represented and stored. This is the main concern of the discipline of data modelling.

The mainstream approach to data modelling assumes that real world objects and relationships can be accurately represented by models.  As an example, a data model representing a sales process might consist of entities such as customers and products and their relationships, such as sales (customer X purchases product Y). It is tacitly assumed that objective, bias-free models of entities and relationships of interest can be built by asking the right questions and using appropriate information collection techniques.

However, things are not quite so straightforward: as professional data modellers know, real-world data models are invariably tainted by compromises between rigour and reality.  This is inevitable because the process of building a data model involves at least two different sets of stakeholders whose interests are often at odds – namely, business users and data modelling professionals. The former are less interested in the purity of model than the business process that it is intended to support; the interests of the latter, however, are often the opposite.

This reveals a truth about data modelling that is not fully appreciated by practitioners: that it is a process of negotiation rather than a search for a true representation of business reality. In other words, it is a socio-technical problem that has wicked elements.  As such then, data modelling ought to be based on the principles of emergent design. In this post I explore this idea drawing on a brilliant paper by Heinz Klein and Kalle Lyytinen entitled, Towards a New Understanding of Data Modelling as well as my own thoughts on the subject.

Background

Klein and Lyytinen begin their paper by asking four questions that are aimed at uncovering the tacit assumptions underlying the different approaches to data modelling.  The questions are:

  1. What is being modelled? This question delves into the nature of the “universe” that a data model is intended to represent.
  2. How well is the result represented? This question asks if the language, notations and symbols used to represent the results are fit for purpose – i.e. whether the language and constructs used are capable of modelling the domain.
  3. Is the result valid? This asks the question as to whether the model is a correct representation of the domain that is being modelled.
  4. What is the social context in which the discipline operates? This question is aimed at eliciting the views of different stakeholders regarding the model: how they will use it, whether their interests are taken into account and whether they benefit or lose from it.

It should be noted that these questions are general in that they can be used to enquire into any discipline. In the next section we use these questions to uncover the tacit assumptions underlying the mainstream view of data modelling. Following that, we propose an alternate set of assumptions that address a major gap in the mainstream view.

Deconstructing the mainstream view

What is being modelled?

As Klein and Lyytinen put it, the mainstream approach to data modelling assumes that the world is given and made up of concrete objects which have natural properties and are associated with [related to] other objects.   This assumption is rooted in a belief that it is possible to build an objectively true picture of the world around us. This is pretty much how truth is perceived in data modelling: data/information is true or valid if it describes something – a customer, an order or whatever – as it actually is.

In philosophy, such a belief is formalized in the correspondence theory of truth, a term that refers to a family of theories that trace their origins back to antiquity. According to Wikipedia:

Correspondence theories claim that true beliefs and true statements correspond to the actual state of affairs. This type of theory attempts to posit a relationship between thoughts or statements on one hand, and things or facts on the other. It is a traditional model which goes back at least to some of the classical Greek philosophers such as Socrates, Plato, and Aristotle. This class of theories holds that the truth or the falsity of a representation is determined solely by how it relates to a reality; that is, by whether it accurately describes that reality.

In short: the mainstream view of data modelling is based on the belief that the things being modelled have an objective existence.

How well is the result represented?

If data models are to represent reality (as it actually is), then one also needs an appropriate means to express that reality in its entirety. In other words, data models must be complete and consistent in that they represent the entire domain and do not contain any contradictory elements. Although this level of completeness and logical rigour is impossible in practice, much research effort is expended in finding evermore complete and logical consistent notations.

Practitioners have little patience with cumbersome notations invented by theorists, so it is no surprise that the most popular modelling notation is the simplest one: the entity-relationship (ER) approach which was first proposed by Peter Chen. The ER approach assumes that the world can be represented by entities (such as customer) with attributes (such as name), and that entities can be related to each other (for example, a customer might be located at an address – here “is located at” is a relationship between the customer and address entities).  Most commercial data modelling tools support this notation (and its extensions) in one form or another.

To summarise: despite the fact that the most widely used modelling notation is not based on rigorous theory, practitioners generally assume that the ER notation is an appropriate vehicle to represent  what is going on in the domain of interest.

Is the result valid?

As argued above, the mainstream approach to data modelling assumes that the world of interest has an objective existence and can be represented by a simple notation that depicts entities of interest and the relationships between them.  This leads to the question of the validity of the models thus built. To answer this we have to understand how data models are constructed.

The process of model-building involves observation, information gathering and analysis – that is, it is akin to the approach used in scientific enquiry. A great deal of attention is paid to model verification, and this is usually done via interaction with subject matter experts, users and business analysts.  To be sure, the initial model is generally incomplete, but it is assumed that it can be iteratively refined to incorporate newly surfaced facts and fix errors.  The underlying belief is that such a process gets ever-closer to the truth.

In short: it is assumed that it an ER model built using a systematic and iterative process of enquiry will result in a model that is a valid representation of the domain of interest.

What is the social context in which the discipline operates?

From the above, one might get the impression that data modelling involves a lot of user interaction. Although this is generally true, it is important to note that the users’ roles are restricted to providing information to data modellers. The modellers then interpret the information provided by users and cast into a model.

This brings up an important socio-political implication of the mainstream approach: data models generally support business applications that are aimed at maintaining and enhancing managerial control through automation and / or improved oversight. Underlying this is the belief that a properly constructed data model (i.e. one that accurately represents reality) can enhance business efficiency and effectiveness within the domain represented by the model.

In brief: data models are built to further the interests of specific stakeholder groups within an organization.

Summarising the mainstream view

The detailed responses to the questions above reveal that the discipline of data modelling is based on the following assumptions:

  1. The domain of interest has an objective existence.
  2. The domain can be represented using a (more or less) logical language.
  3. The language can represent the domain of interest accurately.
  4. The resulting model is based largely on a philosophy of managerial control, and can be used to drive organizational efficiency and effectiveness.

Many (most?) professional data management professionals will see these assumptions as being uncontroversial.  However, as we shall see next, things are not quite so simple…

Motivating an alternate view of data modelling

In an earlier section I mentioned the correspondence theory of truth which tells us that true statements are those that correspond to the actual state of affairs in the real world.   A problem with correspondence theories is that they assume that: a) there is an objective reality, and b) that it is perceived in the same way by everyone. This assumption is problematic, especially for issues that have a social dimension. Such issues are perceived differently by different stakeholders, each of who will seek data that supports their point of view. The problem is that there is no way to determine which data is “objectively right.” More to the point, in such situations the very notion of “objective rightness” can be legitimately questioned.

Another issue with correspondence theories is that a piece of data can at best be an abstraction of a real-world object or event.  This is a serious problem with correspondence theories in the context of business intelligence. For example, when a sales rep records a customer call, he or she notes down only what is required by the CRM system. Other data that may well be more important is not captured or is relegated to a “Notes” or “Comments” field that is rarely if ever searched or accessed.

Another perspective on truth is offered by the so called consensus theory which asserts that true statement are those that are agreed to by the relevant group of people. This is often the way “truth” is established in organisations. For example, managers may choose to calculate KPIs using certain pieces of data that are deemed to be true.  The problem with this is that consensus can be achieved by means that are not necessarily democratic .For example, a KPI definition chosen by a manager may be contested by an employee.  Nevertheless, the employee has to accept it because organisations are not democracies. A more significant issue is that the notion of “relevant group” is problematic because there is no clear criterion by which to define relevance.  Quite naturally this leads to conflict and ill-will.

This conclusion leads one to formulate alternative answers to four questions posed above, thus paving the way to a new approach to data modelling.

An alternate view of data management

What is being modelled?

The discussion of the previous section suggests that data models cannot represent an objective reality because there is never any guarantee that all interested parties will agree on what that reality is.  Indeed, insofar as data models are concerned, it is more useful to view reality as being socially constructed – i.e. collectively built by all those who have a stake in it.

How is reality socially constructed? Basically it is through a process of communication in which individuals discuss their own viewpoints and agree on how differences (if any) can be reconciled. The authors note that:

…the design of an information system is not a question of fitness for an organizational reality that can be modelled beforehand, but a question of fitness for use in the construction of a [collective] organizational reality…

This is more in line with the consensus theory of truth than the correspondence theory.

In brief: the reality that data models are required to represent is socially constructed.

How well is the result represented?

Given the above, it is clear that any data model ought to be subjected to validation by all stakeholders so that they can check that it actually does represent their viewpoint. This can be difficult to achieve because most stakeholders do not have the time or inclination to validate data models in detail.

In view of the above, it is clear that the ER notation and others of its ilk can represent a truth rather than the truth – that is, they are capable of representing the world according to a particular set of stakeholders (managers or users, for example). Indeed, a data model (in whatever notation) can be thought of one possible representation of a domain. The point is, there are as many representations possible as there are stakeholder groups and in mainstream data modelling, and one of these representations “wins” while the others are completely ignored. Indeed, the alternate views generally remain undocumented so they are invariably forgotten. This suggests that a key step in data modelling would be to capture all possible viewpoints on the domain of interest in a way that makes a sensible comparison possible.  Apart from helping the group reach a consensus, such documentation is invaluable to explain to future users and designers why the data model is the way it is. Regular readers of this blog will no doubt see that the IBIS notation and dialogue mapping could be hugely helpful in this process. It would take me too far afield to explore this point here, but I will do so in a future post.

In brief:  notations used by mainstream data modellers cannot capture multiple worldviews in a systematic and useful way. These conventional data modelling languages need to be augmented by notations that are capable of incorporating diverse viewpoints.

Is the result valid?

The above discussion begs a question, though: what if two stakeholders disagree on a particular point?

One answer to this lies in Juergen Habermas’ theory of communicative rationaility that I have discussed in detail in this post. I’ll explain the basic idea via the following excerpt from that post:

When we participate in discussions we want our views to be taken seriously. Consequently, we present our views through statements that we hope others will see as being rational – i.e. based on sound premises and logical thought.  One presumes that when someone makes claim, he or she is willing to present arguments that will convince others of the reasonableness of the claim.  Others will judge the claim based the validity of the statements claimant makes. When the  validity claims are contested, debate ensues with the aim of  getting to some kind of agreement.

The philosophy underlying such a process of discourse (which is simply another word for “debate” or “dialogue”)  is described in the theory of communicative rationality proposed by the German philosopher Jurgen Habermas.  The basic premise of communicative rationality is that rationality (or reason) is tied to social interactions and dialogue. In other words, the exercise of reason can  occur only through dialogue.  Such communication, or mutual deliberation,  ought to result in a general agreement about the issues under discussion.  Only once such agreement is achieved can there be a consensus on actions that need to be taken.  Habermas refers to the  latter  as communicative action,  i.e.  action resulting from collective deliberation…

In brief: validity is not an objective matter but a subjective – or rather an intersubjective one   that is, validity has to be agreed between all parties affected by the claim.

What is the social context in which the discipline operates?

From the above it should be clear that the alternate view of data management is radically different from the mainstream approach. The difference is particularly apparent when one looks at the way in which the alternate approach views different stakeholder groups. Recall that in the mainstream view, managerial perspectives take precedence over all others because the overriding aim of data modelling (as indeed most enterprise IT activities) is control. Yes, I am aware that it is supposed to be about enablement, but the question is enablement for whom? In most cases, the answer is that it enables managers to control. In contrast, from the above we see that the reality depicted in a data (or any other) model is socially constructed – that is, it is based on a consensus arising from debates on the spectrum of views that people hold. Moreover, no claim has precedence over others on virtue of authority. Different interpretations of the world have to be fused together in order to build a consensually accepted world.

The social aspect is further muddied by conflicts between managers on matters of data ownership, interpretation and access. Typically, however, such matters lie outside the purview of data modellers

In brief: the social context in which the discipline operates is that there are a wide variety of stakeholder groups, each of which may hold different views. These must be debated and reconciled.

Summarising the alternate view

The detailed responses to the questions above reveal that the alternate view of data modelling is based on the following assumptions:

  1. The domain of interest is socially constructed.
  2. The standard representations of data models are inadequate because they cannot represent multiple viewpoints. They can (and should) be supplemented by notations that can document multiple viewpoints.
  3. A valid data model is constructed through an iterative synthesis of multiple viewpoints.
  4. The resulting model is based on a shared understanding of the socially constructed domain.

Clearly these assumptions are diametrically opposed to those in the mainstream. Let’s briefly discuss their implications for the profession

Discussion

The most important implication of the alternate view is that a data model is but one interpretation of reality. As such, there are many possible interpretations of reality and the “correctness” of any particular model hinges not on some objective truth but on an agreed, best-for-group interpretation. A consequence of the above is that well-constructed data models “fuse” or “bring together” at least two different interpretations – those of users and modellers. Typically there are many different groups of users, each with their own interpretation. This being the case, it is clear that the onus lies on modellers to reconcile any differences between these groups as they are the ones responsible for creating models.

A further implication of the above is that it is impossible to build a consistent enterprise-wide data model.  That said, it is possible to have a high-level strategic data model that consists, say, of entities but lacks detailed attribute-level information. Such a model can be useful because it provides a starting point for a dialogue between user groups and also serves to remind modellers of the entities they may need to consider when building a detailed data model.

The mainstream view asserts that data is gathered to establish the truth. The alternate view, however, makes us aware that data models are built in such a way as to support particular agendas. Moreover, since the people who use the data are not those who collect or record it, a gap between assumed and actual meaning is inevitable.  Once again this emphasises that fact that the meaning of a particular piece of data is very much in the eye of the beholder.

In closing

The mainstream approach to data modelling reflects the general belief that the methods of natural sciences can be successfully applied in the area of systems development.  Although this is a good assumption for theoretical computer science, which deals with constructs such as data structures and algorithms, it is highly questionable when it comes to systems development in large organisations. In the latter case social factors dominate, and these tend to escape any logical system. This simple fact remains under-appreciated by the data modelling community and, for that matter, much of the universe of corporate IT.

The alternate view described in this post draws attention to the social and political aspects of data modelling. Although IT vendors and executives tend to give these issues short shrift, the chronically high failure rate of   data-centric initiatives (such as those aimed at building enterprise data warehouses) warns us to pause and think.  If there is anything at all to be learnt from these failures, it is that data modelling is not a science that deals with facts and models, but an art that has more to do with negotiation and law-making.

Written by K

January 6, 2015 at 9:56 pm

A data warehousing tragedy in five limericks

with 2 comments

It started with a presentation,
a proforma  regurgitation:
a tired old story,
of a repository
for all data in an organization.

The business was duly seduced
by promises of costs reduced.
But the data warehouse,
so glibly espoused,
was not so simply produced.

For the team was soon in distress,
‘cos the data landscape was a mess:
data duplication,
dodgy information
in databases and files countless.

And politics had them bogged down;
in circles they went round  and round.
Logic paralysed,
totally traumatised,
in a sea of data they drowned.

In the light of the following morn,
the truth upon them did dawn.
An enterprise data store
is IT lore
as elusive as the unicorn.

Written by K

December 20, 2012 at 7:02 pm

Out damn’d SPOT: an essay on data, information and truth in organisations

with 4 comments

Introduction

Jack: My report tells me that we are on track to make budget this year.

Jill: That’s strange, my report tells me otherwise

Jack: That can’t be. Have you used the right filters?

Jill: Yes – the one’s you sent me yesterday.

Jack: There must be something else…my figures must be right, they come from the ERP system.

Jill: Oh, that must be it then…mine are from the reporting system.

Conversations such as the one above occur quite often in organisation-land.  It is one of the reasons why organisations chase the holy grail of a single point of truth (SPOT): an organisation-wide repository that holds the officially endorsed true version of data, regardless of where it originates from. Such a repository is often known as an Enterprise Data Warehouse (EDW).

Like all holy grails, however, the EDW, is a mythical object that exists in only in the pages of textbooks (and vendor brochures…). It is at best an ideal to strive towards. But, like chasing the end of a rainbow it is an exercise that may prove exhausting and ultimately, futile.

Regardless of whether or not organisations can get to that mythical end of the rainbow – and there are those who claim to have got there – there is a deeper issue with the standard view of data and information that hold sway in organisation-land.   In this post I examine these standard conceptions of data and information and truth, drawing largely on this paper by Bernd Carsten Stahl and a number of secondary sources.

Some truths about data and information

As Stahl observes in his introduction:

Many assume that information is central to managerial decision making and that more and higher quality information will lead to better outcomes. This assumption persists even though Russell Ackoff argued over 40 years ago that it is misleading

The reason for the remarkable persistence of this incorrect assumption is that there is a lack of clarity as to what data and information actually are.

To begin with let’s take a look at what these terms mean in the sense in which they are commonly used in organisations. Data typically refers to raw, unprocessed facts or the results of measurements. Information is data that is imbued with meaning and relevance because it is referred to in a context of interest. For example, a piece of numerical data by itself has no meaning – it is just a number. However, its meaning becomes clear once we are provided a context – for example, that the number is the price of a particular product.

The above seems straightforward enough and embodies the standard view of data and information in organisations. However, a closer look reveals some serious problems. For example, what we call raw data is not unprocessed – the data collector always makes a choice as to what data will be collected and what will not. So in this sense, data already has meaning imposed on it. Further, there is no guarantee that what has been excluded is irrelevant. As another example, decision makers will often use data (relevant or not) just because it is available. This is a particularly common practice when defining business KPIs – people often use data that can be obtained easily rather than attempting to measure metrics that are relevant.

Four perspectives on truth

One of the tacit assumptions that managers make about the information available to them is that it is true.  But what exactly does this mean?  Let’s answer this question by taking a whirlwind tour of some theories of truth.

The most commonly accepted notion of truth is that of correspondence, that a statement is true if it describes something as it actually is.  This is pretty much how truth is perceived in business intelligence: data/information is true or valid if it describes something – a customer, an order or whatever – as it actually is.

More generally, the term correspondence theory of truth refers to a family of theories that trace their origins back to antiquity. According to Wikipedia:

Correspondence theories claim that true beliefs and true statements correspond to the actual state of affairs. This type of theory attempts to posit a relationship between thoughts or statements on one hand, and things or facts on the other. It is a traditional model which goes back at least to some of the classical Greek philosophers such as Socrates, Plato, and Aristotle. This class of theories holds that the truth or the falsity of a representation is determined solely by how it relates to a reality; that is, by whether it accurately describes that reality.

One of the problems with correspondence theories is that they require the existence of an objective reality that can be perceived in the same way by everyone. This assumption is clearly problematic, especially for issues that have a social dimension. Such issues are perceived differently by different stakeholders, and each of these will legitimately seek data that supports their point of view. The problem is that there is often no way to determine which data is “objectively right.” More to the point, in such situations the very notion of “objective rightness” can be legitimately questioned.

Another issue with correspondence theories is that a piece of data can at best be an abstraction of a real-world object or event.  This is a serious issue with correspondence theories in the context of data in organisations. For example, when a sales rep records a customer call, he or she notes down only what is required by the customer management system. Other data that may well be more important is not captured or is relegated to a “Notes” or “Comments” field that is rarely if ever searched or accessed.

Another perspective is offered by the so called consensus theories of truth which assert that true statements are those that are agreed to by the relevant group of people. This is often the way truth is established in organisations. For example, managers may choose to calculate Key Performance Indicators (KPIs )using certain pieces of data that are deemed to be true.  The problem with this is that consensus can be achieved by means that are not necessarily democratic. For example, a KPI definition chosen by a manager may be hotly contested by an employee.  Nevertheless, the employee has to accept it because organisations are typically not democratic. A more significant issue is that  the notion of “relevant group” is problematic because there is no clear criterion by which to define relevance.

Pragmatic theories of truth assert that truth is a function of utility – i.e. a statement is true if it is useful to believe it is so. In other words, the truth of a statement is to be judged by the payoff obtained by believing it to be true.  One of the problems with these theories is that it may be useful for some people to believe in a particular statement while is useful for others to disbelieve it. A good example of such a statement is: there is an objective reality. Scientists may find it useful to believe this whereas social constructionists may not. Closer home, it may be useful for a manager to believe that a particular customer is a good prospect (based on market intelligence, say), but a sales rep who knows the customer is unlikely to switch brands may think it useful to believe otherwise.

Finally, coherence theories of truth tell us that statements that are true must be consistent with a wider set of beliefs. In organisational terms, a piece of information or data that is true only if it does not contradict things that others in the organisation believe to be true. Coherence theories emphasise that the truth of statements cannot be established in isolation but must be evaluated as part of a larger system of statements (or beliefs). For example, managers may believe certain KPIs to be true because they fit in with other things they know about their business.

…And so to conclude

The truth is a slippery beast: what is true and what is not depends on what exactly one means by the truth and, as we have seen, there are several different conceptions of truth.

One may well ask if this matters from a practical point of view.  To put it plainly: should executives, middle managers and frontline employees (not to mention business intelligence analysts and data warehouse designers) worry about philosophical theories of truth?  My contention is that they should, if only to understand that the criteria they use for determining the validity of their data and information are little more than conventions that are easily overturned by taking other, equally legitimate, points of view.

Written by K

October 17, 2012 at 9:11 pm

%d bloggers like this: