I have a datraframe in python like that:
st se st_min st_max se_min se_max
42 922444 923190 922434 922454 923180 923200
24 922445 923190 922435 922455 923180 923200
43 928718 929456 928708 928728 929446 929466
37 928718 929459 928708 928728 929449 929469
As we can see, I have a range in the first 2 columns and a variation of 10 positions of the initial range.
I know that function drop_duplicates can remove duplicate rows based on the exact match of values.
But, if I want to remove rows based on a range of values, for example, both indexes 42 and 24 are in the same range (if I considerer a range of 10) and indexes 43 and 37 are in the same case.
How I can do this?
Ps: I can't remove based only in one column (e.g. st or se), I need to remove redundancy based on both columns (st and se), using the range of columns min and max as filters...
I assume, you want to combine all ranges. So that all ranges that overlap are reduced to one row. I think you need to do that recursively, because there could be multiple ranges, that form one big range, not just two. You could do it like this (just replace df
by the variable you use to store your dataframe):
# create a dummy key column to produce a cartesian product
df['fake_key']=0
right_df= pd.DataFrame(df, copy=True)
right_df.rename({col: col + '_r' for col in right_df if col!='fake_key'}, axis='columns', inplace=True)
# this variable indicates that we need to perform the loop once more
change=True
# diff and new_diff are used to see, if the loop iteration changed something
# it's monotically increasing btw.
new_diff= (right_df['se_r'] - right_df['st_r']).sum()
while change:
diff= new_diff
joined_df= df.merge(right_df, on='fake_key')
invalid_indexer= joined_df['se']<joined_df['st_r']
joined_df.drop(joined_df[invalid_indexer].index, axis='index', inplace=True)
right_df= joined_df.groupby('st').aggregate({col: 'max' if '_min' not in col else 'min' for col in joined_df})
# update the ..._min / ..._max fields in the combined range
for col in ['st_min', 'se_min', 'st_max', 'se_max']:
col_r= col + '_r'
col1, col2= (col, col_r) if 'min' in col else (col_r, col)
right_df[col_r]= right_df[col1].where(right_df[col1]<=right_df[col2], right_df[col2])
right_df.drop(['se', 'st_r', 'st_min', 'se_min', 'st_max', 'se_max'], axis='columns', inplace=True)
right_df.rename({'st': 'st_r'}, axis='columns', inplace=True)
right_df['fake_key']=0
# now check if we need to iterate once more
new_diff= (right_df['se_r'] - right_df['st_r']).sum()
change= diff <= new_diff
# now all ranges which overlap have the same value for se_r
# so we just need to aggregate on se_r to remove them
result= right_df.groupby('se_r').aggregate({col: 'min' if '_max' not in col else 'max' for col in right_df})
result.rename({col: col[:-2] if col.endswith('_r') else col for col in result}, axis='columns', inplace=True)
result.drop('fake_key', axis='columns', inplace=True)
If you execute this on your data, you get:
st se st_min st_max se_min se_max
se_r
923190 922444 923190 922434 922455 923180 923200
929459 928718 929459 922434 928728 923180 929469
Note, if your data set is larger than a few thousand records, you might need to change the join logic above which produces a cartesian product. So in the first iteration, you get a joined_df
of the size n^2, where n is the number of records in your input dataframe. Then later in each iteration the joined_df
will get smaller due to the aggregation.
I just ignored that, because I don't know, how large your dataset is. Avoiding this would make the code a bit more complex. But if you need it, you could just create an auxillary dataframe which allows you to "bin" the se
values on both dataframes and use the binned value as the fake_key
. It's not quite regular binning, you would have to create a dataframe that contains for each fake_key
all values of the in the range (0...fake_key
). So e.g. if you define your fake key to be fake_key=se//1000
, your dataframe would contain
fake_key fake_key_join
922 922
922 921
922 920
... ...
922 0
If you replace the merge
in the loop above by code, that merges such a dataframe on fake_key
with right_df
and the result on fake_key_join
with df
you can use the rest of the code and get the same result as above but without having to produce a full cartesian product.