## Tackling the John Smith Problem – deduplicating data via fuzzy matching in R

Last week I attended a CRM & data user group meeting for not-for-profits (NFPs), organized by my friend Yael Wasserman from Mission Australia. Following a presentation from a vendor, we broke up into groups and discussed common data quality issues that NFPs (and dare I say most other organisations) face. Number one on the list was the vexing issue of duplicate constituent (donor) records – henceforth referred to as dupes. I like to call this the *John Smith Problem* as it is likely that a typical customer database in a country with a large Anglo population is likely to have a fair number of records for customers with that name. The problem is tricky because one has to identify John Smiths who appear to be distinct in the database but are actually the same person, while also ensuring that one does not inadvertently merge two distinct John Smiths.

The John Smith problem is particularly acute for NFPs as much of their customer data comes in either via manual data entry or bulk loads with less than optimal validation. To be sure, all the NFPs represented at the meeting have some level of validation on both modes of entry, but all participants admitted that dupes tend to sneak in nonetheless…and at volumes that merit serious attention. Yael and his team have had some success in cracking the dupe problem using SQL-based matching of a combination of fields such as first name, last name and address or first name, last name and phone number and so on. However, as he pointed out, this method is limited because:

- It does not allow for typos and misspellings.
- Matching on too few fields runs the risk of false positives – i.e. labelling non-dupes as dupes.

The problems arise because SQL-based matching requires one to pre-specify match patterns. The solution is straightforward: use fuzzy matching instead. The idea behind fuzzy matching is simple: allow for inexact matches, assigning each match a *similarity score* ranging from 0 to 1 with 0 being complete dissimilarity and 1 being a perfect match. My primary objective in this article is to show how one can make headway with the John Smith problem using the fuzzy matching capabilities available in R.

### A bit about fuzzy matching

Before getting down to fuzzy matching, it is worth a brief introduction on how it works. The basic idea is simple: one has to generalise the notion of a match from a binary *“match” / “no match”* to allow for partial matching. To do this, we need to introduce the notion of an edit distance, which is essentially the *minimum number of operations required to transform one string into another*. For example, the edit distance between the strings *boy* and *bay* is 1: there’s only one edit required to transform one string to the other. The Levenshtein distance is the most commonly used edit distance. It is essentially, “*the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.”*

A variant called the Damerau-Levenshtein distance, which additionally allows for the transposition of two adjacent characters (counted as one operation, not two) is found to be more useful in practice. We’ll use an implementation of this called the *optimal string alignment* (osa) distance. If you’re interested in finding out more about osa, check out the Damerau-Levenshtein article linked to earlier in this paragraph.

Since longer strings will potentially have larger numeric distances between them, it makes sense to normalise the distance to a value lying between 0 and 1. We’ll do this by dividing the calculated osa distance by the length of the larger of the two strings . Yes, this is crude but, as you will see, it works reasonably well. The resulting number is a normalised measure of the dissimilarity between the two strings. To get a similarity measure we simply subtract the dissimilarity from 1. So, a normalised dissimilarity of 1 translates to similarity score of 0 – i.e. the strings are perfectly *dissimilar*. I hope I’m not belabouring the point; I just want to make sure it is perfectly clear before going on.

### Preparation

In what follows, I assume you have R and RStudio installed. If not, you can access the software here and here for Windows and here for Macs; installation for both products is usually quite straightforward.

You may also want to download the Excel file **many_john_smiths** which contains records for ten fictitious John Smiths. At this point I should affirm that as far as the dataset is concerned, *a**ny resemblance to actual John Smiths, living or dead, is purely coincidental! *Once you have downloaded the file you will want to open it in Excel and examine the records and save it as a csv file in your R working directory (or any other convenient place) for processing in R.

As an aside, if you have access to a database, you may also want to load the file into a table called **many_john_smiths** and run the following dupe-detecting SQL statement:

select * from many_john_smiths t1

where exists

(select 'x' from many_john_smiths t2

where

t1.FirstName=t2.FirstName

and

t1.LastName=t2.LastName

and

t1.AddressPostcode=t2.AddressPostcode

and

t1.CustomerID <> t2.CustomerID)

You may also want to try matching on other column combinations such as *First/Last Name and AddressLine1* or F*irst/Last Name and AddressSuburb* for example. The limitations of column-based exact matching will be evident immediately. Indeed, I have deliberately designed the records to highlight some of the issues associated with dirty data: misspellings, typos, misheard names over the phone etc. A quick perusal of the records will show that there are *probably* two distinct John Smiths in the list. The problem is to quantify this observation. We do that next.

### Tackling the John Smith problem using R

We’ll use the following libraries: stringdist, stringi and rockchalk. The first library, stringdist, contains a bunch of string distance functions, we’ll use stringdistmatrix() which returns a matrix of pairwise string distances (osa by default) when passed a vector of strings; stringi has a number of string utilities from which we’ll use str_length(), which returns the length of string; rockchalk contains a useful function vech2mat() which casts a half-vector to a triangular matrix (this will make more sense when you see it in action).

OK, so on to the code. The first step is to load the required libraries:

`#load libraries`

library("stringdist")

library("stringr")

library("rockchalk")

We then read in the data, ensuring that we override the annoying default behaviour of R, which is to convert strings to categorical variables – we want strings to remain strings!

`#read data, taking care to ensure that strings remain strings`

df <- read.csv("many_john_smiths.csv",stringsAsFactors = F)

#examine dataframe

str(df)

The output from str(df) (not shown) indicates that all columns barring *CustomerID* are indeed strings (i.e. type=character).

The next step is to find the length of each row:

#find length of string formed by each row (excluding title)

rowlen <- str_length(paste0(df$FirstName,df$LastName,df$AddressLine1,

df$AddressPostcode,df$AddressSuburb,df$Phone))

#examine row lengths

rowlen

> [1] 41 43 39 42 28 41 42 42 42 43

Note that I have excluded the *Title* column as I did not think it was relevant to determining duplicates.

Next we find the distance between every pair of records in the dataset. We’ll use the stringdistmatrix()function mentioned earlier:

#stringdistmatrix - finds pairwise osa distance between every pair of elements in a

#character vector

d <- stringdistmatrix(paste0(df$FirstName,df$LastName,df$AddressLine1,

df$AddressPostcode,df$AddressSuburb,df$Phone))

stringdistmatrix() returns an object of type dist (distance), so we’ll cast it into a matrix object for use later. We’ll also set the diagonal entries to 0 (since we know that the distance between a string and itself is zero):

#cast d as a proper matrix

d_matrix <- as.matrix(d)

#set diagonals to 0 - distance between a string and itself is 0

diag(d_matrix) <- 0

For reasons that will become clear later, it is convenient to normalise the distance – i.e. scale it to a number that lies between 0 and 1. We’ll do this by dividing the distance between two strings by the length of the longer string. We’ll use the nifty base R function combn() to compute the maximum length for every pair of strings:

#find the length of the longer of two strings in each pair

pwmax <- combn(rowlen,2,max,simplify = T)

The first argument is the vector from which combinations are to be generated, the second is the group size (2, since we want pairs) and the third argument indicates whether or not the result should be returned as an array (simplify=T) or list (simplify=F). The returned object, pwmax, is a one-dimensional array containing the pairwise maximum lengths. We convert this to a proper matrix using the vech2mat() function from rockchalk.

#convert the resulting vector to a matrix, the diagonal entries should be the

#respective rowlength

pmax_matrix <- vech2mat(pwmax,diag=rowlen)

Now we divide the distance matrix by the maximum length matrix in order to obtain a normalised distance:

#normalise distance

norm_dist_matrix <- d_matrix/pmax_matrix

The normalised distance lies between 0 and 1 (check this!) so we can define similarity as 1 minus distance:

#similarity = 1 - distance

sim_matrix <-round(1-norm_dist_matrix,2)

sim_matrix

1 2 3 4 5 6 7 8 9 10

1 1.00 0.84 0.76 0.64 0.54 0.46 0.52 0.76 0.55 0.60

2 0.84 1.00 0.70 0.51 0.40 0.51 0.47 0.70 0.49 0.49

3 0.76 0.70 1.00 0.43 0.33 0.32 0.38 0.60 0.55 0.42

4 0.64 0.51 0.43 1.00 0.64 0.71 0.79 0.52 0.50 0.70

5 0.54 0.40 0.33 0.64 1.00 0.56 0.50 0.45 0.43 0.49

6 0.46 0.51 0.32 0.71 0.56 1.00 0.67 0.40 0.31 0.56

7 0.52 0.47 0.38 0.79 0.50 0.67 1.00 0.48 0.45 0.63

8 0.76 0.70 0.60 0.52 0.45 0.40 0.48 1.00 0.48 0.49

9 0.55 0.49 0.55 0.50 0.43 0.31 0.45 0.48 1.00 0.44

10 0.60 0.49 0.42 0.70 0.49 0.56 0.63 0.49 0.44 1.00

#write out the similarity matrix

write.csv(sim_matrix,file="similarity_matrix.csv")

The similarity matrix looks quite reasonable: you can, for example, see that records 1 and 2 (similarity score=0.84) are quite similar while records 1 and 6 are quite dissimilar (similarity score=0.46). Now let’s extract some results more systematically. We’ll do this by printing out the top 5 non-diagonal similarity scores and the associated records for each of them. This needs a bit of work. To start with, we note that the similarity matrix (like the distance matrix) is symmetric so we’ll convert it into an upper triangular matrix to avoid double counting. We’ll also set the diagonal entries to 0 to avoid comparing a record with itself:

#convert to upper triangular to prevent double counting

sim_matrix[lower.tri(sim_matrix)] <- 0

#set diagonals to zero to avoid comparing row to itself

diag(sim_matrix) <- 0

Next we create a function that returns the n largest similarity scores and their associated row and column number – we’ll need the latter to identify the pair of records that are associated with each score:

#adapted from:

#https://stackoverflow.com/questions/32544566/find-the-largest-values-on-a-matrix-in-r

nlargest <- function(m, n) {

res <- order(m, decreasing = T)[seq_len(n)];

pos <- arrayInd(res, dim(m), useNames = TRUE);

list(values = m[res],

position = pos)

}

The function takes two arguments: a matrix m and a number n indicating the top n scores to be returned. Let’s set this number to 5 – i.e. we want the top 5 scores and the associated record indexes. We’ll store the output of nlargest in the variable sim_list:

top_n <- 5

sim_list <- nlargest(sim_matrix,top_n)

Finally, we loop through sim_list printing out the scores and associated records as we go along:

for (i in 1:top_n){

rec <- as.character(df[sim_list$position[i],])

sim_rec <- as.character(df[sim_list$position[i+top_n],])

cat("score: ",sim_list$values[i],"\n")

cat("record 1: ",rec,"\n")

cat ("record 2: ",sim_rec,"\n\n")

}

score: 0.84

record 1: 1 John Smith Mr 12 Acadia Rd Burnton 9671 1234 5678

record 2: 2 Jhon Smith Mr 12 Arcadia Road Bernton 967 1233 5678

score: 0.79

record 1: 4 John Smith Mr 13 Kynaston Rd Burnton 9671 34561234

record 2: 7 Jon Smith Mr. 13 Kinaston Rd Barnston 9761 36451223

score: 0.76

record 1: 1 John Smith Mr 12 Acadia Rd Burnton 9671 1234 5678

record 2: 3 J Smith Mr. 12 Acadia Ave Burnton 867`1 1233 567

score: 0.76

record 1: 1 John Smith Mr 12 Acadia Rd Burnton 9671 1234 5678

record 2: 8 John Smith Dr 12 Aracadia St Brenton 9761 12345666

score: 0.71

record 1: 4 John Smith Mr 13 Kynaston Rd Burnton 9671 34561234

record 2: 6 John S Dr. 12 Kinaston Road Bernton 9677 34561223

As you can see, the method correctly identifies close matches: there appear to be 2 distinct records (1 and 4) – and possibly more, depending on where one sets the similarity threshold. I’ll leave you to explore this further on your own.

### The John Smith problem in real life

As a proof of concept, I ran the following SQL on a real CRM database hosted on SQL Server:

select

FirstName+LastName,

count(*)

from

TableName

group by

FirstName+LastName

having

count(*)>100

order by

count(*) desc

I was gratified to note that John Smith did indeed come up tops – well over 200 records. I suspected there were a few duplicates lurking within, so I extracted the records and ran the above R code (with a few minor changes). I found there indeed were some duplicates! I also observed that the code ran with no noticeable degradation despite the dataset having well over 10 times the number of records used in the toy example above. I have not run it for larger datasets yet, but I suspect one will run into memory issues when the number of records gets into the thousands. Nevertheless, based on my experimentation thus far, this method appears viable for small datasets.

The problem of deduplicating large datasets is left as an exercise for motivated readers 😛

### Wrapping up

Often organisations will turn to specialist consultancies to fix data quality issues only to find that their work, besides being quite pricey, comes with a lot of caveats and cosmetic fixes that do not address the problem fully. Given this, there is a case to be made for doing as much of the exploratory groundwork as one can so that one gets a good idea of what can be done and what cannot. At the very least, one will then be able to keep one’s consultants on their toes. In my experience, the John Smith problem ranks right up there in the list of data quality issues that NFPs and many other organisations face. This article is intended as a starting point to address this issue using an easily available and cost effective technology.

Finally, I should reiterate that the approach discussed here is just one of many possible and is neither optimal nor efficient. Nevertheless, it works quite well on small datasets, and is therefore offered here as a starting point for your own attempts at tackling the problem. If you come up with something better – as I am sure you can – I’d greatly appreciate your letting me know via the contact page on this blog or better yet, a comment.

**Acknowledgements:**

I’m indebted to Homan Zhao and Sree Acharath for helpful conversations on fuzzy matching. I’m also grateful to all those who attended the NFP CRM and Data User Group meetup that was held earlier this month – the discussions at that meeting inspired this piece.

Thanks Kailash – great post. The edit-distance approach is very elegant. Compute constraints notwithstanding I can see how this would work well when working with a single field set e.g, name+address. I wonder how it would go when we introduce email and multiple phone number fields, some of which may be blank for some records. For example what score might we expect for the following

Jon Smith Mr 13 Kynaston Rd Burnton johnsmith@example.com Mobile: 0425 555 555

John Smith Mr. 13 Kinaston Rd Barnston Home Phone: 9761 36451223 Work Phone 9761 5555555

I imagine it would involve running multiple matrices using different field sets e.g., Test 1: Name, Address; Test 2: Name, Email and so on, then take the highest score for each pairing. So Test 1 might produce 0.95, Test 2 0.75 so 0.95 becomes the score.

You’ve certainly left me with something to think about

LikeLiked by 1 person

Yael WassermanOctober 10, 2019 at 9:25 am

Hi Yael,

Many thanks for taking the time to read and for your feedback.

Yep, the optimal choice of fields is a tricky one. In the POC involving 200+ John Smiths (described later in the post) I chose First Name, Last Name, AddressLine1, Suburb, Postcode, EmailAddress and Phone as these are the most commonly used ones for matching. The results were much better than I expected in that I was actually able to identify duplicate records quite accurately. That said, I think the optimal choice of fields would involve a fair bit of trial and error so your suggestion of comparing outputs from different field combinations is an excellent one, and definitely worth testing.

Thanks again!

Regards,

K.

LikeLike

KOctober 10, 2019 at 10:06 am