I have a huge dataset held in a csv file which holds 2 columns in excel of data which I need to print out the row ID of which ones have the same results (regardless of the order of numbers) and another list of results that are not the same. Some cells have more than 2 integers in them. I am trying to use Python to do this, I've also read around and Panda seems to be the way forwards. Please note I am new to Python and cannot seem to find a simple script to do this (only ones that delete duplicates).
So for example the required output would give me
Duplicated Row ID's: 1, 3
.
Unique Row ID's: 2, 4
.
A B C
ID Col1 Col2
-----------------------------
1 ["1", "3"] ["3", "1"]
2 ["1", "2"] ["9", "2"]
3 ["8", "0"] ["0", "8"]
4 ["3", "7"] ["1", "3"]
First if necessary convert string repr to lists:
import ast
df['ColA'] = df['ColA'].apply(ast.literal_eval)
df['ColB'] = df['ColB'].apply(ast.literal_eval)
Then sorting both columns lists and compare by Series.eq
, last filter by DataFrame.loc
with boolean indexing
, ~
is for invert mask:
m = df['ColA'].apply(sorted).eq(df['ColB'].apply(sorted))
dup = df.loc[m, 'ID']
print (dup)
0 1
2 3
Name: ID, dtype: int64
un = df.loc[~m, 'ID']
print (un)
1 2
3 4
Name: ID, dtype: int64
For improve performance is possible convert lists to DataFrames and compare, only necessary same lengths of lists in both columns and in all rows:
df1 = pd.DataFrame(np.sort(df['ColA'].tolist(), axis=1), index=df.index)
df2 = pd.DataFrame(np.sort(df['ColB'].tolist(), axis=1), index=df.index)
print (df1)
0 1
0 1 3
1 1 2
2 0 8
3 3 7
m = df1.eq(df2).all(axis=1)
print (m)
0 True
1 False
2 True
3 False
dtype: bool