Search code examples
pythonlistsubtraction

How to identify duplicate strings in lists?


I am importing a column with 1280 (so I thought) unique IDs from a DataFrame from a csv-file.

I had planned to put every ID into a dictionary as a key and set '0' as the value. And then put everything into a new DataFrame.

When extracting the column from the DataFrame as a list, I noticed that the number was reduced to 1189 instead of 1280.

I figured, there must be duplicates in the original DataFrame. That would be a surprise since the IDs are supposed to be unique IDs. I could take a shortcut and just use the list for the new DataFrame. However, it is vital that I figure out what's going on and identify duplicates if there are any.

The only problem is, I can't identify any duplicates. I'm at a loss as to what the problem could be.

import pandas as pd
from itertools import cycle

DF0 = pd.read_csv("FILENAME.csv", sep='$', encoding='utf-8-sig')

l_o_0 = ['0']

l_DF0 = list(DF0['Short_ID'])
print('  len of origin object   '+str(len(DF0['Short_ID'])))
print('            l_DF0 is a   '+str(type(l_DF0)))
print('                of len   '+str(len(l_DF0))+'\n')

d_DF0 = dict(zip(DF0['Short_ID'], cycle(l_o_0)))
print('  len of origin object   '+str(len(DF0['Short_ID'])))
print('            d_DF0 is a   '+str(type(d_DF0)))
print('                of len   '+str(len(d_DF0))+'\n')

print('           difference:   '+(str(len(DF0['Short_ID'])-len(d_DF0)))+'\n')

s_DF0 = set(l_DF0)
print('            s_DF0 is a   '+str(type(s_DF0)))
print('             of length   '+str(len(s_DF0))+'\n')

red_l_DF0 = list(s_DF0)
print('        red_l_DF0 is a   '+str(type(red_l_DF0)))
print('             of length   '+str(len(red_l_DF0))+'\n')

l_prob = []
for item in l_DF0:
    if item not in red_l_DF0:
        l_prob.append(item)
print(len(l_prob))

The output is:

  len of origin object   1280
            l_DF0 is a   <class 'list'>
                of len   1280

  len of origin object   1280
            d_DF0 is a   <class 'dict'>
                of len   1189

           difference:   91

            s_DF0 is a   <class 'set'>
             of length   1189

        red_l_DF0 is a   <class 'list'>
             of length   1189

           l_prob is a   <class 'list'>
             of length   0
>>>

I tried the above based on what I found here:
Python list subtraction operation
Either I'm not using tool properly or it's the wrong tool. Any help would be appreciated -- thanks in advance!!


Solution

  • Use pandas' duplicated function:

    duplicated_stuff = DF0[DF0['Short_ID'].duplicated()]
    

    Depending on what you want to see change the keep parameter of duplicated. For your debugging you probably want keep=False.