Search code examples
pythonpandasif-statementclip

Python - Remove row if item is above certain value and replace if between other values


I'm working in a pandas dataframe trying to clean up some data and I want to assign multiple rules to a certain column. If the column value is greater than 500 I want to drop the column. If the column value is between 101 and 500 I want to replace the value with 100. When the column is less than 101 return the column value.

As you can see from the image I want to drop the Item B row, and adjust the item C row

I'm able to do it in 2 lines of code, but I was curious if there's a cleaner more efficient way to do this. I tried with an If/Elif/Else, but I couldn't get it to run or a lambda function, but again I couldn't get it to run.

# This drops all rows that are greater than 500
df.drop(df[df.Percent > 500].index, inplace = True)

# This sets the upper limit on all values at 100
df['Percent'] = df['Percent'].clip(upper = 100)

Solution

  • You can use .loc with boolean mask instead of .drop() with index and use fast numpy function numpy.where() to achieve more efficient / better performance, as follows:

    import numpy as np
    
    df2 = df.loc[df['Percent'] <= 500]
    df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
    

    Performance Comparison:

    Part 1: Original size dataframe

    Old Codes:

    %%timeit
    df.drop(df[df.Percent > 500].index, inplace = True)
    
    # This sets the upper limit on all values at 100
    df['Percent'] = df['Percent'].clip(upper = 100)
    
    1.58 ms ± 56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    New Codes:

    %%timeit
    df2 = df.loc[df['Percent'] <= 500]
    df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
    
    784 µs ± 8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    Benchmarking result:

    The new codes take 784 µs while the old codes take 1.58 ms:

    Around 2x times faster

    Part 2: Large size dataframe

    Let's use a dataframe 10000 times the original size:

    df9 = pd.concat([df] * 10000, ignore_index=True)
    

    Old Codes:

    %%timeit
    df9.drop(df9[df9.Percent > 500].index, inplace = True)
    
    # This sets the upper limit on all values at 100
    df9['Percent'] = df9['Percent'].clip(upper = 100)
    
    3.87 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    New Codes:

    %%timeit
    df2 = df9.loc[df9['Percent'] <= 500]
    df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
    
    1.96 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Benchmarking result:

    The new codes take 1.96 ms while the old codes take 3.87 ms :

    Also around 2x times faster