Search code examples
pythonexcelpython-3.xopenpyxlbreak

How to eliminate duplicates from an Excel file trough openpyxl


I have two excel columns, name and company, and I'm trying to figure out what's the easiest way to make sure to obtain as an output a list of tuples (name, company) without any duplicates

The following code seems correct to me but for some reason it's not working, probably because of some dumb error I can't seem to find.

updated = openpyxl.load_workbook('abc.xlsx')
u_wb = updated.get_sheet_by_name('SP_Table')
u_names = u_wb['F'] #column F is where the names are
u_company = u_wb['C'] #column C is where the company's name are
l=[]

for x in range(len(u_names)-1):
    i=x
    i+=1
    if u_company[x].value==None #in case a field is missing
        continue
    if i==len(u_names):
        break
    for z in l:
        r=(u_names[x].value, u_names[x].value)
        if r==z:
            continue
    else:
        t=(u_names[x].value, u_company[x].value)
        l.append(t)
print("Number of contacts:", len(l))

I don't get any error and the contact's count is actually reduced but only because of the u_company[x].value==None clause. Any help or resource is appreciated


Solution

  • The condition on which you're trying to ignore duplicates is not correct.

    You are adding to the list couples of (u_names[x].value, u_company[x].value). That is OK and makes sense. The problem is that you're checking if (u_names[x].value, u_names[x].value) is already in the list.

    Apart from that, even if it was the same, you're simply doing nothing when you find a duplicate. Your else statement after the for will always execute! This is because else statement after a for loop happens when the loop came to conclusion without hitting a break statement. So, what you want to do is:

    for x in range(len(u_names)):
        if u_company[x].value==None #in case a field is missing
            continue
    
        r = (u_names[x].value, u_company[x].value)
        if r in l:
            continue
        else:
            l.append(t)
    
    print("Number of contacts:", len(l))