I am trying to get rid of the rows that contain element combinations of the first two columns in a pandas dataframe, for instance, in the next df:
event1 event2 uccs ulags
0 327423.0 329243.0 0.1663 -0.6013
1 327423.0 329589.0 0.1911 -0.4730
2 329243.0 327423.0 0.1663 0.6013
3 329243.0 329589.0 0.3101 -0.7502
4 329589.0 327423.0 0.1911 0.4730
5 329589.0 329243.0 0.3101 0.7502
rows 0 and 2 present a combination of elements: event1 and event2. That is:
0 327423.0 329243.0
2 329243.0 327423.0
In general, I need to reduce the matrix or df to 3 rows, removing all duplicates.
I have tried the next without success:
u = df.filter(like='event1').values
m = pd.DataFrame(np.sort(u, axis=1)).duplicated()
doing that I get:
event1 event2 uccs ulags
0 327423.0 329243.0 0.1663 -0.6013
2 329243.0 327423.0 0.1663 0.6013
4 329589.0 327423.0 0.1911 0.4730
However as you can see, rows 0 and 2 are duplicated. Any tips about how to do this in pandas or numpy will be appreciated.
If I understand you correctly, you were quite close, you just have to pass your boolean array back to your dataframe for boolean indexing
:
df[~pd.DataFrame(np.sort(df.filter(like='event'), axis=1)).duplicated()]
event1 event2 uccs ulags
0 327423.0 329243.0 0.1663 -0.6013
1 327423.0 329589.0 0.1911 -0.4730
3 329243.0 329589.0 0.3101 -0.7502
Notice the ~
I used which is inverse or the not operator:
m = pd.DataFrame(np.sort(df.filter(like='event'), axis=1)).duplicated()
print(m)
0 False
1 False
2 True
3 False
4 True
5 True
dtype: bool
Now inverse:
print(~m)
0 True
1 True
2 False
3 True
4 False
5 False
dtype: bool