Just about every organization I've worked with has had some sort of data quality problem, most notably the presence of duplicate customer records. I think we can all agree that duplicate customer records can cause all sorts of problems ranging from inconsistent customer experience to inaccurate reporting and analytics.
Unfortunately many think they aren't equipped to deal with this issue. They feel that in order to get their customer data clean they are going to need to implement some complicated and expensive piece of software like an MDM system (yuck) or a CDP (meh). Not only do these software packages take a long time to select and implement, the customer matching capabilities are often not that impressive.
Let’s end this paralysis, roll up our sleeves and clean up our customer data with the tools we have. My grandfather's motto was “it’s better to do something than nothing”, and to me this definitely resonates with this problem. All you need is a good old database, preferably an MPP like Redshift, and a tiny bit of Python. I will share an approach I've used in multiple organizations, in most cases exceeding the results from the aforementioned systems. Although I’ve potentially thrown some shade on this approach by calling it quick and dirty, this is a perfectly reasonable, and productionizable system for large organizations as well as small and scrappy ones.
Standardization and Matching
The first steps in our data cleanup is to standardize our data and then essentially perform a self join to match our results together. The theme of this article is quick and dirty, so we are going to do only very light cleanup. After we walk through this simple but effective solution I’ll provide some tips on how to make this quite sophisticated.
In the first CTE below I do some minor cleanup. I parse the first segment of the zip and strip non numeric phone numbers. Based on my understanding of the data I may choose to do more operations such as trimming strings or padding numbers. I also assembled a concatenated field phone_list which will allow me to compare phone numbers across a number of fields. I could assemble a similar field if I had multiple emails or addresses.
with cust as (
select
customernumber,
firstname,
lastname,
email,
regexp_replace(mobile, '[^0-9]','') mobile,
regexp_replace(phone1, '[^0-9]','') phone1,
regexp_replace(phone2, '[^0-9]','') phone2,
regexp_replace(phone3, '[^0-9]','') phone3,
regexp_replace(phone4, '[^0-9]','') phone4,
nvl(regexp_replace(mobile, '[^0-9]',''), '') || '|'
|| nvl(regexp_replace(phone1, '[^0-9]',''), '') || '|'
|| nvl(regexp_replace(phone2, '[^0-9]',''), '') || '|'
|| nvl(regexp_replace(phone3, '[^0-9]',''), '') || '|'
|| nvl(regexp_replace(phone4, '[^0-9]',''), '') as phone_list,
address1,
substring(zip, 0, charindex('-', zip) - 1) as zip
from customer A )
I then perform the match. Keeping it simple, I require an exact match on the name. I then OR together a number of conditions such as email, phone, or address. I use charindex
to test the existence of a phone number in my concatenated phone_list
field. I also use a left(a.address1,5)
AND zip which might seem strange but in practice I've found very effective.
select
a.*,
b.customernumber as matched_customernumber
from cust a
join cust b on a.firstname = b.firstname
and a.lastname = b .lastname
and a.customernumber <> b.customernumber -- we don't want to join the same record to itself
and (
a.email = b.email
or charindex(a.mobile, b.phone_list) > 0
or charindex(a.phone1, b.phone_list) > 0
or charindex(a.phone2, b.phone_list) > 0
or charindex(a.phone3, b.phone_list) > 0
or charindex(a.phone4, b.phone_list) > 0
or (left(a.address1,5) = left(b.address1,5) and a.zip = b.zip)
)
Enhancing The Process
You can put as much effort into standardization and matching as you have time to invest. This effort will afford you a higher match rate (reducing undermatching/more dupes eliminated). However at a certain point, especially if you make things too fuzzy you can result in overmatching (false positives/potentially merging unique people). Given time, you have the opportunity to easily rival commercial tooling, just use your time wisely:
- Parsing addresses - standardize and break addresses into parts. You will get a more deterministic join based on street and house number vs a string match
- Resolve names to root forms - there are many available lists and csv files out there that will resolve common abbreviations and nicknames such as the infamous Robert = Bob
- Fuzzy matching - transforming field values to fuzzy representations or using fuzzy matching techniques such as Levenshtein Distance
- Cleaning bad values - null out addresses and emails in your match set which appear in high frequency or match known bad values (such as a store location)
Much of the above can be done in SQL, however the beauty of platforms like Amazon Redshift is they can be extended with Python. Here is a great article on packaging and leveraging the Python module fuzzywuzzy within Redshift.
Clustering and the “Graph Trick”
I bet a few of you have tried something similar to what i’ve outlined above and then faced a problem - how the heck do you cluster these matches and establish a survivor. If this term “survivor” doesn’t immediately register , it essentially represents collapsing a bunch of duplicate records into a single “golden record” which will remain, and all other matched records removed. Let’s look at a quick example to prove why this is both important as well as a difficult task.
For simplicity, I’m only showing half of the rows in the result, as you will have the same match in a reverse relationship. As you can see there is no natural hierarchy, and that there are transitive and multiple matches across the dataset. The first record matches the second, the second matches the third and so on. There is no way to easily move forward here without being able to cluster the results. In SQL terms we want all matches group-able by a partition by
clause.
You might be tempted to try a recursive CTE, however since this operation is designed for hierarchies and does not tolerate loops, your query will likely run infinitely and time out. Applying limits is not a good option either as you have no control over where the query will terminate and may have incomplete results.
This is where remodeling this problem as a graph can really simplify the problem. As you can see the picture becomes a lot clearer when modeled as nodes and edges instead of rows. And it’s not just simple for us, it’s also simpler from a programming model perspective.
We can now load this data into a graph model and iterate the subgraphs. The subgraphs being the small disconnected graphs within the larger graph, and in this case just so happen to be our clusters.
import networkx as nx
Graphtype = nx.Graph()
with open('result.csv', "r") as data:
# skip the header
next(data, None)
# parse into the graph
G = nx.parse_edgelist(data, delimiter=',', create_using=Graphtype)
# append the individual subgraph nodes to a list
results = []
for x in (G.subgraph(c) for c in nx.connected_components(G)):
results.append(x.nodes())
parent = ''
with open('output.csv', "w") as output:
for group in results:
parent = list(group)[0]
for match in group:
output.write(','.join([parent, match]) + "\n")
In this example I assign the first customer in the flattened subgraph as survivor (list(group)[0])
, but you could obviously apply any business logic that is appropriate. Perhaps picking the lowest number, the oldest record, or maybe even importing a dataset of customer spend and choosing the customer record with the highest value or oldest/latest transaction date. But if your trying to keep it simple, and not that familiar with Python or graphs you can use the above script as is, and resolve the rest in SQL
Once you have created this dataset, you can reload it into your database and use it to create your golden record with relative ease. You can now leverage windowed functions, coalesce, and other familiar SQL operations since we can now partition by
the survivor's customer number.
Quick and Dirty?
Quick and dirty is a good attention getter, however I actually consider this to be a pretty darn robust solution for building both one-time and recurring customer deduplication solutions. I’d go so far as to say it represents using the right tool for the right job. Matching and final record assembly is easily expressed in SQL, and accessible for most data wranglers to tune and improve. Likewise clustering and survivorship almost perfectly fits the graph model. As a side note, I’d encourage you to keep graphs in mind when solving other types of problems as Graphs Are Everywhere..
One other thing I wanted to mention - this solution performs!. On a customer dataset of about 1 million records, Redshift crushed the match in under 5 seconds, and the graph clustering job in just about 2 seconds! I've personally waited hours or even days for commercial tooling to process a full match on customer datasets of similar scale.
I hope you enjoyed this solution and are encouraged to go squash some duplicates!