Beyond entities and relationships – towards an emergent approach to data modelling
Introduction – some truths about data modelling
It has been said that data is the lifeblood of business. The aptness of this metaphor became apparent to when I was engaged in a data mapping project some years ago. The objective of that effort was to document all the data flows within the organization. The final map showed very clearly that the volume of data on the move was good indicator of the activity of the function: the greater the volume, the more active the function. This is akin to the case of the human body wherein organs that expend the more energy tend to have a richer network of blood vessels.
Although the above analogy is far from perfect, it serves to highlight the simple fact that most business activities involve the movement and /or processing of data. Indeed, the key function of information systems that support business activities is to operate on and transfer data. It therefore matters a lot as to how data is represented and stored. This is the main concern of the discipline of data modelling.
The mainstream approach to data modelling assumes that real world objects and relationships can be accurately represented by models. As an example, a data model representing a sales process might consist of entities such as customers and products and their relationships, such as sales (customer X purchases product Y). It is tacitly assumed that objective, bias-free models of entities and relationships of interest can be built by asking the right questions and using appropriate information collection techniques.
However, things are not quite so straightforward: as professional data modellers know, real-world data models are invariably tainted by compromises between rigour and reality. This is inevitable because the process of building a data model involves at least two different sets of stakeholders whose interests are often at odds – namely, business users and data modelling professionals. The former are less interested in the purity of model than the business process that it is intended to support; the interests of the latter, however, are often the opposite.
This reveals a truth about data modelling that is not fully appreciated by practitioners: that it is a process of negotiation rather than a search for a true representation of business reality. In other words, it is a socio-technical problem that has wicked elements. As such then, data modelling ought to be based on the principles of emergent design. In this post I explore this idea drawing on a brilliant paper by Heinz Klein and Kalle Lyytinen entitled, Towards a New Understanding of Data Modelling as well as my own thoughts on the subject.
Klein and Lyytinen begin their paper by asking four questions that are aimed at uncovering the tacit assumptions underlying the different approaches to data modelling. The questions are:
- What is being modelled? This question delves into the nature of the “universe” that a data model is intended to represent.
- How well is the result represented? This question asks if the language, notations and symbols used to represent the results are fit for purpose – i.e. whether the language and constructs used are capable of modelling the domain.
- Is the result valid? This asks the question as to whether the model is a correct representation of the domain that is being modelled.
- What is the social context in which the discipline operates? This question is aimed at eliciting the views of different stakeholders regarding the model: how they will use it, whether their interests are taken into account and whether they benefit or lose from it.
It should be noted that these questions are general in that they can be used to enquire into any discipline. In the next section we use these questions to uncover the tacit assumptions underlying the mainstream view of data modelling. Following that, we propose an alternate set of assumptions that address a major gap in the mainstream view.
Deconstructing the mainstream view
What is being modelled?
As Klein and Lyytinen put it, the mainstream approach to data modelling assumes that the world is given and made up of concrete objects which have natural properties and are associated with [related to] other objects. This assumption is rooted in a belief that it is possible to build an objectively true picture of the world around us. This is pretty much how truth is perceived in data modelling: data/information is true or valid if it describes something – a customer, an order or whatever – as it actually is.
In philosophy, such a belief is formalized in the correspondence theory of truth, a term that refers to a family of theories that trace their origins back to antiquity. According to Wikipedia:
Correspondence theories claim that true beliefs and true statements correspond to the actual state of affairs. This type of theory attempts to posit a relationship between thoughts or statements on one hand, and things or facts on the other. It is a traditional model which goes back at least to some of the classical Greek philosophers such as Socrates, Plato, and Aristotle. This class of theories holds that the truth or the falsity of a representation is determined solely by how it relates to a reality; that is, by whether it accurately describes that reality.
In short: the mainstream view of data modelling is based on the belief that the things being modelled have an objective existence.
How well is the result represented?
If data models are to represent reality (as it actually is), then one also needs an appropriate means to express that reality in its entirety. In other words, data models must be complete and consistent in that they represent the entire domain and do not contain any contradictory elements. Although this level of completeness and logical rigour is impossible in practice, much research effort is expended in finding evermore complete and logical consistent notations.
Practitioners have little patience with cumbersome notations invented by theorists, so it is no surprise that the most popular modelling notation is the simplest one: the entity-relationship (ER) approach which was first proposed by Peter Chen. The ER approach assumes that the world can be represented by entities (such as customer) with attributes (such as name), and that entities can be related to each other (for example, a customer might be located at an address – here “is located at” is a relationship between the customer and address entities). Most commercial data modelling tools support this notation (and its extensions) in one form or another.
To summarise: despite the fact that the most widely used modelling notation is not based on rigorous theory, practitioners generally assume that the ER notation is an appropriate vehicle to represent what is going on in the domain of interest.
Is the result valid?
As argued above, the mainstream approach to data modelling assumes that the world of interest has an objective existence and can be represented by a simple notation that depicts entities of interest and the relationships between them. This leads to the question of the validity of the models thus built. To answer this we have to understand how data models are constructed.
The process of model-building involves observation, information gathering and analysis – that is, it is akin to the approach used in scientific enquiry. A great deal of attention is paid to model verification, and this is usually done via interaction with subject matter experts, users and business analysts. To be sure, the initial model is generally incomplete, but it is assumed that it can be iteratively refined to incorporate newly surfaced facts and fix errors. The underlying belief is that such a process gets ever-closer to the truth.
In short: it is assumed that it an ER model built using a systematic and iterative process of enquiry will result in a model that is a valid representation of the domain of interest.
What is the social context in which the discipline operates?
From the above, one might get the impression that data modelling involves a lot of user interaction. Although this is generally true, it is important to note that the users’ roles are restricted to providing information to data modellers. The modellers then interpret the information provided by users and cast into a model.
This brings up an important socio-political implication of the mainstream approach: data models generally support business applications that are aimed at maintaining and enhancing managerial control through automation and / or improved oversight. Underlying this is the belief that a properly constructed data model (i.e. one that accurately represents reality) can enhance business efficiency and effectiveness within the domain represented by the model.
In brief: data models are built to further the interests of specific stakeholder groups within an organization.
Summarising the mainstream view
The detailed responses to the questions above reveal that the discipline of data modelling is based on the following assumptions:
- The domain of interest has an objective existence.
- The domain can be represented using a (more or less) logical language.
- The language can represent the domain of interest accurately.
- The resulting model is based largely on a philosophy of managerial control, and can be used to drive organizational efficiency and effectiveness.
Many (most?) professional data management professionals will see these assumptions as being uncontroversial. However, as we shall see next, things are not quite so simple…
Motivating an alternate view of data modelling
In an earlier section I mentioned the correspondence theory of truth which tells us that true statements are those that correspond to the actual state of affairs in the real world. A problem with correspondence theories is that they assume that: a) there is an objective reality, and b) that it is perceived in the same way by everyone. This assumption is problematic, especially for issues that have a social dimension. Such issues are perceived differently by different stakeholders, each of who will seek data that supports their point of view. The problem is that there is no way to determine which data is “objectively right.” More to the point, in such situations the very notion of “objective rightness” can be legitimately questioned.
Another issue with correspondence theories is that a piece of data can at best be an abstraction of a real-world object or event. This is a serious problem with correspondence theories in the context of business intelligence. For example, when a sales rep records a customer call, he or she notes down only what is required by the CRM system. Other data that may well be more important is not captured or is relegated to a “Notes” or “Comments” field that is rarely if ever searched or accessed.
Another perspective on truth is offered by the so called consensus theory which asserts that true statement are those that are agreed to by the relevant group of people. This is often the way “truth” is established in organisations. For example, managers may choose to calculate KPIs using certain pieces of data that are deemed to be true. The problem with this is that consensus can be achieved by means that are not necessarily democratic .For example, a KPI definition chosen by a manager may be contested by an employee. Nevertheless, the employee has to accept it because organisations are not democracies. A more significant issue is that the notion of “relevant group” is problematic because there is no clear criterion by which to define relevance. Quite naturally this leads to conflict and ill-will.
This conclusion leads one to formulate alternative answers to four questions posed above, thus paving the way to a new approach to data modelling.
An alternate view of data management
What is being modelled?
The discussion of the previous section suggests that data models cannot represent an objective reality because there is never any guarantee that all interested parties will agree on what that reality is. Indeed, insofar as data models are concerned, it is more useful to view reality as being socially constructed – i.e. collectively built by all those who have a stake in it.
How is reality socially constructed? Basically it is through a process of communication in which individuals discuss their own viewpoints and agree on how differences (if any) can be reconciled. The authors note that:
…the design of an information system is not a question of fitness for an organizational reality that can be modelled beforehand, but a question of fitness for use in the construction of a [collective] organizational reality…
This is more in line with the consensus theory of truth than the correspondence theory.
In brief: the reality that data models are required to represent is socially constructed.
How well is the result represented?
Given the above, it is clear that any data model ought to be subjected to validation by all stakeholders so that they can check that it actually does represent their viewpoint. This can be difficult to achieve because most stakeholders do not have the time or inclination to validate data models in detail.
In view of the above, it is clear that the ER notation and others of its ilk can represent a truth rather than the truth – that is, they are capable of representing the world according to a particular set of stakeholders (managers or users, for example). Indeed, a data model (in whatever notation) can be thought of one possible representation of a domain. The point is, there are as many representations possible as there are stakeholder groups and in mainstream data modelling, and one of these representations “wins” while the others are completely ignored. Indeed, the alternate views generally remain undocumented so they are invariably forgotten. This suggests that a key step in data modelling would be to capture all possible viewpoints on the domain of interest in a way that makes a sensible comparison possible. Apart from helping the group reach a consensus, such documentation is invaluable to explain to future users and designers why the data model is the way it is. Regular readers of this blog will no doubt see that the IBIS notation and dialogue mapping could be hugely helpful in this process. It would take me too far afield to explore this point here, but I will do so in a future post.
In brief: notations used by mainstream data modellers cannot capture multiple worldviews in a systematic and useful way. These conventional data modelling languages need to be augmented by notations that are capable of incorporating diverse viewpoints.
Is the result valid?
The above discussion begs a question, though: what if two stakeholders disagree on a particular point?
When we participate in discussions we want our views to be taken seriously. Consequently, we present our views through statements that we hope others will see as being rational – i.e. based on sound premises and logical thought. One presumes that when someone makes claim, he or she is willing to present arguments that will convince others of the reasonableness of the claim. Others will judge the claim based the validity of the statements claimant makes. When the validity claims are contested, debate ensues with the aim of getting to some kind of agreement.
The philosophy underlying such a process of discourse (which is simply another word for “debate” or “dialogue”) is described in the theory of communicative rationality proposed by the German philosopher Jurgen Habermas. The basic premise of communicative rationality is that rationality (or reason) is tied to social interactions and dialogue. In other words, the exercise of reason can occur only through dialogue. Such communication, or mutual deliberation, ought to result in a general agreement about the issues under discussion. Only once such agreement is achieved can there be a consensus on actions that need to be taken. Habermas refers to the latter as communicative action, i.e. action resulting from collective deliberation…
In brief: validity is not an objective matter but a subjective – or rather an intersubjective one that is, validity has to be agreed between all parties affected by the claim.
What is the social context in which the discipline operates?
From the above it should be clear that the alternate view of data management is radically different from the mainstream approach. The difference is particularly apparent when one looks at the way in which the alternate approach views different stakeholder groups. Recall that in the mainstream view, managerial perspectives take precedence over all others because the overriding aim of data modelling (as indeed most enterprise IT activities) is control. Yes, I am aware that it is supposed to be about enablement, but the question is enablement for whom? In most cases, the answer is that it enables managers to control. In contrast, from the above we see that the reality depicted in a data (or any other) model is socially constructed – that is, it is based on a consensus arising from debates on the spectrum of views that people hold. Moreover, no claim has precedence over others on virtue of authority. Different interpretations of the world have to be fused together in order to build a consensually accepted world.
The social aspect is further muddied by conflicts between managers on matters of data ownership, interpretation and access. Typically, however, such matters lie outside the purview of data modellers
In brief: the social context in which the discipline operates is that there are a wide variety of stakeholder groups, each of which may hold different views. These must be debated and reconciled.
Summarising the alternate view
The detailed responses to the questions above reveal that the alternate view of data modelling is based on the following assumptions:
- The domain of interest is socially constructed.
- The standard representations of data models are inadequate because they cannot represent multiple viewpoints. They can (and should) be supplemented by notations that can document multiple viewpoints.
- A valid data model is constructed through an iterative synthesis of multiple viewpoints.
- The resulting model is based on a shared understanding of the socially constructed domain.
Clearly these assumptions are diametrically opposed to those in the mainstream. Let’s briefly discuss their implications for the profession
The most important implication of the alternate view is that a data model is but one interpretation of reality. As such, there are many possible interpretations of reality and the “correctness” of any particular model hinges not on some objective truth but on an agreed, best-for-group interpretation. A consequence of the above is that well-constructed data models “fuse” or “bring together” at least two different interpretations – those of users and modellers. Typically there are many different groups of users, each with their own interpretation. This being the case, it is clear that the onus lies on modellers to reconcile any differences between these groups as they are the ones responsible for creating models.
A further implication of the above is that it is impossible to build a consistent enterprise-wide data model. That said, it is possible to have a high-level strategic data model that consists, say, of entities but lacks detailed attribute-level information. Such a model can be useful because it provides a starting point for a dialogue between user groups and also serves to remind modellers of the entities they may need to consider when building a detailed data model.
The mainstream view asserts that data is gathered to establish the truth. The alternate view, however, makes us aware that data models are built in such a way as to support particular agendas. Moreover, since the people who use the data are not those who collect or record it, a gap between assumed and actual meaning is inevitable. Once again this emphasises that fact that the meaning of a particular piece of data is very much in the eye of the beholder.
The mainstream approach to data modelling reflects the general belief that the methods of natural sciences can be successfully applied in the area of systems development. Although this is a good assumption for theoretical computer science, which deals with constructs such as data structures and algorithms, it is highly questionable when it comes to systems development in large organisations. In the latter case social factors dominate, and these tend to escape any logical system. This simple fact remains under-appreciated by the data modelling community and, for that matter, much of the universe of corporate IT.
The alternate view described in this post draws attention to the social and political aspects of data modelling. Although IT vendors and executives tend to give these issues short shrift, the chronically high failure rate of data-centric initiatives (such as those aimed at building enterprise data warehouses) warns us to pause and think. If there is anything at all to be learnt from these failures, it is that data modelling is not a science that deals with facts and models, but an art that has more to do with negotiation and law-making.