Python - Remove row if item is above certain value and replace if between other values

I'm working in a pandas dataframe trying to clean up some data and I want to assign multiple rules to a certain column. If the column value is greater than 500 I want to drop the column. If the column value is between 101 and 500 I want to replace the value with 100. When the column is less than 101 return the column value.

I'm able to do it in 2 lines of code, but I was curious if there's a cleaner more efficient way to do this. I tried with an If/Elif/Else, but I couldn't get it to run or a lambda function, but again I couldn't get it to run.

# This drops all rows that are greater than 500
df.drop(df[df.Percent > 500].index, inplace = True)

# This sets the upper limit on all values at 100
df['Percent'] = df['Percent'].clip(upper = 100)

Solution

You can use .loc with boolean mask instead of .drop() with index and use fast numpy function numpy.where() to achieve more efficient / better performance, as follows:

import numpy as np

df2 = df.loc[df['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])

Performance Comparison:

Part 1: Original size dataframe

Old Codes:

%%timeit
df.drop(df[df.Percent > 500].index, inplace = True)

# This sets the upper limit on all values at 100
df['Percent'] = df['Percent'].clip(upper = 100)

1.58 ms ± 56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

New Codes:

%%timeit
df2 = df.loc[df['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])

784 µs ± 8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Benchmarking result:

The new codes take 784 µs while the old codes take 1.58 ms:

Around 2x times faster

Part 2: Large size dataframe

Let's use a dataframe 10000 times the original size:

df9 = pd.concat([df] * 10000, ignore_index=True)

Old Codes:

%%timeit
df9.drop(df9[df9.Percent > 500].index, inplace = True)

# This sets the upper limit on all values at 100
df9['Percent'] = df9['Percent'].clip(upper = 100)

3.87 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

New Codes:

%%timeit
df2 = df9.loc[df9['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])

1.96 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Benchmarking result:

The new codes take 1.96 ms while the old codes take 3.87 ms :

Also around 2x times faster