I'm working in a pandas dataframe trying to clean up some data and I want to assign multiple rules to a certain column. If the column value is greater than 500 I want to drop the column. If the column value is between 101 and 500 I want to replace the value with 100. When the column is less than 101 return the column value.
I'm able to do it in 2 lines of code, but I was curious if there's a cleaner more efficient way to do this. I tried with an If/Elif/Else, but I couldn't get it to run or a lambda function, but again I couldn't get it to run.
# This drops all rows that are greater than 500
df.drop(df[df.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df['Percent'] = df['Percent'].clip(upper = 100)
You can use .loc
with boolean mask instead of .drop()
with index and use fast numpy function numpy.where()
to achieve more efficient / better performance, as follows:
import numpy as np
df2 = df.loc[df['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
Part 1: Original size dataframe
Old Codes:
%%timeit
df.drop(df[df.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df['Percent'] = df['Percent'].clip(upper = 100)
1.58 ms ± 56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
New Codes:
%%timeit
df2 = df.loc[df['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
784 µs ± 8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Benchmarking result:
The new codes take 784 µs while the old codes take 1.58 ms:
Around 2x times faster
Part 2: Large size dataframe
Let's use a dataframe 10000 times the original size:
df9 = pd.concat([df] * 10000, ignore_index=True)
Old Codes:
%%timeit
df9.drop(df9[df9.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df9['Percent'] = df9['Percent'].clip(upper = 100)
3.87 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New Codes:
%%timeit
df2 = df9.loc[df9['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
1.96 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Benchmarking result:
The new codes take 1.96 ms while the old codes take 3.87 ms :
Also around 2x times faster