Deduplicate Salesforce Contact Email Addresses with Python

Katie - Nov 23 '18 - - Dev Community

Salesforce #AwesomeAdmin Matthew (@matthewsalibi) asked how he could use Python to deduplicate Salesforce NPSP email addresses across several fields. Today I’ll provide you a script you can copy, paste, and try on your own data. And for the geeks, I’ll break it down and explain how every line works.

Data & Expectations

Input Data

As a recap, Matthew proposed that we start with a CSV-formatted export of his “Contacts” table, “ contacts.csv ,” that looked something like this (I consolidated his examples and added a bit of complexity):

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
003…01 BBBB Chidi Anagonye Work ca@gp.com ca@hotmail.com ca@hotmail.com
003…02 DDDD Eleanor Shellstrop Personal es@gp.com
003…03 EEEE Jason Mendoza Work jm@gp.com jm@gp.com jm@gp.com
003…04 FFFF Eleanor Shellstrop Alternate es@hotmail.com es@gp.com
003…05 GGGG Janet Della-Denunzio Alternate jd@gp.com jm@gp.com
003…06 HHHH Tahani Al-Jamil Alternate ta@gp.com ta@yahoo.com ta@gp.com
003…07 IIII Chidi Anagonye Work ca@gp.com
003…08 JJJJ Mindy St. Claire Personal ms@mp.com
003…09 KKKK Kamilah Al-Jamil Personal ka@kafoundation.org
003…10 AAAA Eleanor Shellstrop Alternate es@gp.com
003…11 CCCC Ellie Shellstrop Work es@gp.com
003…12 LLLL Mindy St. Claire Personal ms@yahoo.com ms@hotmail.com ms@hotmail.com
003…13 MMMM Mindy St. Claire Work ms@z.com ms@z.com ms@z.com
003…14 NNNN Tahani Al-Jamil Alternate ta@gp.com

Output File #1

Matthew wanted to see an “ output1.csv ” that would identify duplicates within rows and provide a “cleanup” file, ready for Data Loader, that looks like this:

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
003…01 BBBB Chidi Anagonye Work ca@gp.com ca@hotmail.com
003…03 EEEE Jason Mendoza Work jm@gp.com
003…06 HHHH Tahani Al-Jamil Alternate ta@yahoo.com ta@gp.com
003…13 MMMM Mindy St. Claire Work ms@z.com

Output File #2A

To identify entire rows suspected of being redundant Contacts, Matthew proposed a “ready for DemandTools” fileoutput2a.csv ” that would look like the data below.

This data is compatible with the “Select Master / Non-Master records from file” functionality of DemandTools’s “Single Table Dedupe” module.

Each row indicates a “same email address, same first name” pairing that was found in multiple Contact records (where the “email address” could have appeared in any of the “Email” fields that come with the NPSP).

Each column is the Salesforce ID of a Contact in which the email address was found.

Master Duplicate_1 Duplicate_2
003…01 003…07
003…10 003…02 003…04
003…06 003…14

Output File #2B

For sets of Contacts that shared an email address but not a first name, Matthew wanted to see a bit more detail in the output file “ output2b.csv ”, to facilitate human review. He suggested it look like the output below.

emAddr Master Master_FN Master_LN Duplicate_1 Duplicate_1_FN Duplicate_1_LN
es@gp.com 003…10 Eleanor Shellstrop 003…11 Ellie Shellstrop
jm@gp.com 003…03 Jason Mendoza 003…05 Janet Della-Denunzio

Running The Code Yourself

Unfortunately, I haven’t yet made a blog post about running Python code on your own computer, so if you’re totally new to Python, you’re a bit stuck – but show this code to a colleague who does know how!

Or, if you’re adventurous , download an “Anaconda suite” installer for your computer’s operating system and give it a try.

  • Don’t bother installing VSCode when it asks if you want to.
  • Once it’s done, bring up the “Anaconda Navigator” and open “Spyder.”
    • Type print('hello') in the code-editing panel at the left, hit “run” near the top (if a box pops up, hit “cancel” and do it again), and then check to see if hello is printed at the right-hand side of your screen in an output console (probably bottom right).
    • If so, congrats – you just wrote a program! Backspace out print(hello) and copy/paste my code in instead, then start tweaking it.

And stay tuned for my “office hours” and “hands-on trainings,” where I teach total non-programmers to run code like this.


The Code

There’s no shame in simply copying this code, pasting it into your Python “IDE,” and making a few tweaks!

You will, of course, have to do a bit of “Find & Replace” to fix up column-name references (which, by the way, are case-sensitive to match their capitalization in your input CSV file ), if your columns aren’t precisely spelled “PersonalEmail,” “WorkEmail,” “Human-Readable ID,” etc.

Code to produce “Output1” only

import time
start_time = time.time()
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running

df = pandas.read_csv(inputFilePath, dtype=object)

df['ChangedAnything'] = False

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

for fieldName in emFieldNames:
    needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

output1df = df[df['ChangedAnything']]

output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])

if len(output1df) > 0:
    output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
    with open(output1NoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')
Enter fullscreen mode Exit fullscreen mode

Code to produce “Output2A” and “Output2B” only

import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running

df = pandas.read_csv(inputFilePath, dtype=object)

df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])

allEmailsSeries = df[emFieldNames].stack()

groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}

output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
    seenFirstNamesAndTheirContacts = {}
    numCellsFound = len(foundCells)
    output2bFact = OrderedDict()
    if numCellsFound > 1:
        for cellLoopCounter, cell in enumerate(foundCells):
            if cell[2] not in seenFirstNamesAndTheirContacts:
                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
                if cellLoopCounter == 0:
                    output2bFact['emAddr'] = emAddr
                    output2bFact['Master'] = cell[1]
                    output2bFact['Master_FN'] = cell[2]
                    output2bFact['Master_LN'] = cell[3]
                else:
                    uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
            elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
                seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
            if cellLoopCounter == numCellsFound-1:
                if 'Duplicate_1' in output2bFact:
                    output2bList.append(output2bFact)
                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
                    output2aFact = OrderedDict()
                    if len(contactsForFN) > 1:
                        for contactLoopCounter, contact in enumerate(contactsForFN):
                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]
                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

if len(output2aList) > 0:
    output2adf = pandas.DataFrame(output2aList)
    output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
    with open(output2aNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

if len(output2bList) > 0:
    output2bdf = pandas.DataFrame(output2bList)
    output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
    with open(output2bNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')
Enter fullscreen mode Exit fullscreen mode

Code to produce all 3 output files at once

import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running

df = pandas.read_csv(inputFilePath, dtype=object)

# ### Output1 portion starts here ###
df['ChangedAnything'] = False

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

for fieldName in emFieldNames:
    needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

output1df = df[df['ChangedAnything']]

output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])

if len(output1df) > 0:
    output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
    with open(output1NoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
# ### Output1 portion ends here ###

# ### Output2 portion starts here ###
df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])

allEmailsSeries = df[emFieldNames].stack()

groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}

output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
    seenFirstNamesAndTheirContacts = {}
    numCellsFound = len(foundCells)
    output2bFact = OrderedDict()
    if numCellsFound > 1:
        for cellLoopCounter, cell in enumerate(foundCells):
            if cell[2] not in seenFirstNamesAndTheirContacts:
                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
                if cellLoopCounter == 0:
                    output2bFact['emAddr'] = emAddr
                    output2bFact['Master'] = cell[1]
                    output2bFact['Master_FN'] = cell[2]
                    output2bFact['Master_LN'] = cell[3]
                else:
                    uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
            elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
                seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
            if cellLoopCounter == numCellsFound-1:
                if 'Duplicate_1' in output2bFact:
                    output2bList.append(output2bFact)
                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
                    output2aFact = OrderedDict()
                    if len(contactsForFN) > 1:
                        for contactLoopCounter, contact in enumerate(contactsForFN):
                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]
                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

if len(output2aList) > 0:
    output2adf = pandas.DataFrame(output2aList)
    output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
    with open(output2aNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

if len(output2bList) > 0:
    output2bdf = pandas.DataFrame(output2bList)
    output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
    with open(output2bNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
# ### Output2 portion ends here ###

print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')
Enter fullscreen mode Exit fullscreen mode

Stop reading here if you don’t want to learn Python

You don’t have to thoroughly understand the code above to make use of it.

Just be sure that your “output” files are different filenames than your “input” file and you can’t do too much damage.

Worst-case scenario, you’ll freeze your computer running the program against too large of a file – but several million rows is nothing (a minute or few), so get up and stretch. It’s good for you!

For the curious, the rest of this post is total geekery peeking under the hood.


Code Explanations

I'm struggling to get all the formatting to show up correctly on DEV, so check out my breakdown at the original blog post and come back here with questions, comments, or corrections. I look forward to hearing from you!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .