Search code examples
pandasdataframeweighting

Iteratively remove unsorted rows on condition until defined dataframe size reached


How can I reduce the size of a dataframe, a row at a time based on a condition(partnersCount =1, & selectionWeighting in order - lowest to highest) until the dataframe is a specified size.

Psuedo:

  • iterate the size of the df bigger than we expect // if it is are there any single partnerIDs (partnersCount=1) // if any of the partnerCount:1 && selectionWeighting < 0, remove them

ideally, id like to continue the above, so that:

  • if the df is still too large, // find like partnerID grouped multiples with the lowest selectionWeighting and remove those grouping/pairs iteratively until the desired size is reached. The difficulty here is that same partnerIDs might have different selectionWeightings

Current approach is functional, but struggling to believe that this is the best way to achieve this. Data/df sizes are always <2k. Perhaps someone can recommend an alternative approach. Note: For reasons i wont go into, but applying a sort and removing from the bottom/top up is not an option.

import pandas as pd

# some pretend data
df = pd.DataFrame({'ID': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                   'selectionWeighting': [1,1,.45,.45,.3,.3,.2, .2,.2,.2,.1,.1,0,0,0,1,1,.45,.3,.3],
                  'partnerID': [1,1,4,4,3,9,2, 2,11,2,1,1,0,7,0,1,1,4,3,1],
                  'partnersCount': [6,6,3,3,2,1,3, 3,1,3,6,6,2,1,2,6,6,3,2,1]}
                 
                 )
df.reset_index(drop=True, inplace=True)
df = df.sort_values(['selectionWeighting'], ascending=[False])

# init 
# in reality, these are variable and coming from elsewhere. 
targetDfSize = 18 # target size changes everytime. 
currentDfSize = df.shape[0]
difference = max(0,currentDfSize - targetDfSize)

if difference:
    for i in range(difference):
        canRemove = df[(df['partnersCount']==1) & (df['selectionWeighting']!=1) ] # get those prioritised for removal
        #display(canRemove)
        df = df[df['partnersCount']>1] # clean up before we put the remove rows back
        #display(df)
        if canRemove.shape[0]>0: 
            # so there are some single partners we can remove
            
            display(df)
            canRemove = canRemove.iloc[:-1] # drop 1, remove last row 

            df = df.append(canRemove, ignore_index=True) # append the remainder back, then go check if we're still too big
            print(df.shape[0])
df

Solution

  • I think there are a lot of expensive operations and maybe useless used in your example.

    If I understand well your example, you first remove all rows from canRemove with df = df[df['partnersCount']>1] and then append again all but one rows from canRemove with df = df.append(canRemove, ignore_index=True), why not removing only one row at a time ?

    Secondly, if you know how many rows to remove, why looping over ?

    I suggest this implementation :

    ## select all rows such that partnersCount is 1 and selectionWeighting is not 1
    can_remove = df.query("partnersCount == 1 and selectionWeighting != 1")
    
    ## select only the n last rows you can remove
    to_remove = can_remove.iloc[-difference :, :]
    
    ## construct a mask returning false if a row is in to_remove
    df_mask = df.ne(to_remove)
    
    ## new dataframe using the mask
    df = df[df_mask].dropna()
    

    For heavy data, I strongly advice you to use query to speed up your code, and to not use for loop with Pandas. There are a lot of functional optimization with Pandas which you can exploit.

    I hope I answered your question

    Edit to better answer op's question and removing useless query.