Search code examples
pythonpandasredundancy

drop_duplicates in a range


I have a datraframe in python like that:

    st      se      st_min  st_max  se_min  se_max 
42  922444  923190  922434  922454  923180  923200
24  922445  923190  922435  922455  923180  923200
43  928718  929456  928708  928728  929446  929466
37  928718  929459  928708  928728  929449  929469

As we can see, I have a range in the first 2 columns and a variation of 10 positions of the initial range.

I know that function drop_duplicates can remove duplicate rows based on the exact match of values.

But, if I want to remove rows based on a range of values, for example, both indexes 42 and 24 are in the same range (if I considerer a range of 10) and indexes 43 and 37 are in the same case.

How I can do this?

Ps: I can't remove based only in one column (e.g. st or se), I need to remove redundancy based on both columns (st and se), using the range of columns min and max as filters...


Solution

  • I assume, you want to combine all ranges. So that all ranges that overlap are reduced to one row. I think you need to do that recursively, because there could be multiple ranges, that form one big range, not just two. You could do it like this (just replace df by the variable you use to store your dataframe):

    # create a dummy key column to produce a cartesian product
    df['fake_key']=0
    right_df= pd.DataFrame(df, copy=True)
    right_df.rename({col: col + '_r' for col in right_df if col!='fake_key'}, axis='columns', inplace=True)
    
    # this variable indicates that we need to perform the loop once more
    change=True
    # diff and new_diff are used to see, if the loop iteration changed something
    # it's monotically increasing btw.
    new_diff= (right_df['se_r'] - right_df['st_r']).sum()
    while change:
        diff= new_diff
        joined_df= df.merge(right_df, on='fake_key')
        invalid_indexer= joined_df['se']<joined_df['st_r']    
        joined_df.drop(joined_df[invalid_indexer].index, axis='index', inplace=True)
        right_df= joined_df.groupby('st').aggregate({col: 'max' if '_min' not in col else 'min' for col in joined_df})
        # update the ..._min / ..._max fields in the combined range
        for col in ['st_min', 'se_min', 'st_max', 'se_max']:
            col_r= col + '_r'
            col1, col2= (col, col_r) if 'min' in col else (col_r, col)
            right_df[col_r]= right_df[col1].where(right_df[col1]<=right_df[col2], right_df[col2])
        right_df.drop(['se', 'st_r', 'st_min', 'se_min', 'st_max', 'se_max'], axis='columns', inplace=True)
        right_df.rename({'st': 'st_r'}, axis='columns', inplace=True)
        right_df['fake_key']=0
        # now check if we need to iterate once more
        new_diff= (right_df['se_r'] - right_df['st_r']).sum()
        change= diff <= new_diff
    
    # now all ranges which overlap have the same value for se_r
    # so we just need to aggregate on se_r to remove them
    result= right_df.groupby('se_r').aggregate({col: 'min' if '_max' not in col else 'max' for col in right_df})
    result.rename({col: col[:-2] if col.endswith('_r') else col for col in result}, axis='columns', inplace=True)
    result.drop('fake_key', axis='columns', inplace=True)
    

    If you execute this on your data, you get:

                st      se  st_min  st_max  se_min  se_max
    se_r                                                  
    923190  922444  923190  922434  922455  923180  923200
    929459  928718  929459  922434  928728  923180  929469
    

    Note, if your data set is larger than a few thousand records, you might need to change the join logic above which produces a cartesian product. So in the first iteration, you get a joined_df of the size n^2, where n is the number of records in your input dataframe. Then later in each iteration the joined_df will get smaller due to the aggregation.

    I just ignored that, because I don't know, how large your dataset is. Avoiding this would make the code a bit more complex. But if you need it, you could just create an auxillary dataframe which allows you to "bin" the se values on both dataframes and use the binned value as the fake_key. It's not quite regular binning, you would have to create a dataframe that contains for each fake_key all values of the in the range (0...fake_key). So e.g. if you define your fake key to be fake_key=se//1000, your dataframe would contain

    fake_key  fake_key_join
    922       922
    922       921
    922       920
    ...       ...
    922       0
    

    If you replace the merge in the loop above by code, that merges such a dataframe on fake_key with right_df and the result on fake_key_join with df you can use the rest of the code and get the same result as above but without having to produce a full cartesian product.