Search code examples
pythonpandasopencsv

Output a list of duplicate values ID's and non duplicate values using Python


I have a huge dataset held in a csv file which holds 2 columns in excel of data which I need to print out the row ID of which ones have the same results (regardless of the order of numbers) and another list of results that are not the same. Some cells have more than 2 integers in them. I am trying to use Python to do this, I've also read around and Panda seems to be the way forwards. Please note I am new to Python and cannot seem to find a simple script to do this (only ones that delete duplicates).

So for example the required output would give me

Duplicated Row ID's: 1, 3.

Unique Row ID's: 2, 4.

 A        B            C
ID      Col1         Col2
-----------------------------
1    ["1", "3"]   ["3", "1"]        
2    ["1", "2"]   ["9", "2"]
3    ["8", "0"]   ["0", "8"]
4    ["3", "7"]   ["1", "3"]

Solution

  • First if necessary convert string repr to lists:

    import ast
    
    df['ColA'] = df['ColA'].apply(ast.literal_eval)
    df['ColB'] = df['ColB'].apply(ast.literal_eval)
    

    Then sorting both columns lists and compare by Series.eq, last filter by DataFrame.loc with boolean indexing, ~ is for invert mask:

    m = df['ColA'].apply(sorted).eq(df['ColB'].apply(sorted))
    dup = df.loc[m, 'ID']
    print (dup)
    0    1
    2    3
    Name: ID, dtype: int64
    
    un = df.loc[~m, 'ID']
    print (un)
    1    2
    3    4
    Name: ID, dtype: int64
    

    For improve performance is possible convert lists to DataFrames and compare, only necessary same lengths of lists in both columns and in all rows:

    df1 = pd.DataFrame(np.sort(df['ColA'].tolist(), axis=1), index=df.index)
    df2 = pd.DataFrame(np.sort(df['ColB'].tolist(), axis=1), index=df.index)
    print (df1)
       0  1
    0  1  3
    1  1  2
    2  0  8
    3  3  7
    
    m = df1.eq(df2).all(axis=1)
    print (m)
    0     True
    1    False
    2     True
    3    False
    dtype: bool