Search code examples
pythonpandasdataframefilteringmasking

The masking (filtering) of pandas dataframe is too slow


I have a dataframe that has around 19000 rows with 3 columns (X, Y, Z), and I am trying to mask the dataframe so that I have data with X_max>X>=X_min, Y_max>Y>Y_min, and Z_max>Z>Z_min.

In this example,

df['X'] is 0.0, 0.1, 0.2, 0.3, ..., 5.0
df['Y'] is -3.0, -2.9, -2.8, ..., 3.0
df['Z'] is -2.0, -1.9, ..., -1.5

so, the number of row is 51 * 61 * 6 = 18666

When I create a condition for masking, it takes about 1 second.

cond1 = df['X']>=X_min

I have 6 conditions as below, and creating 6 conditions take about 3-3.5 seconds.

start1 = time()
cond1 = df['X']>=X_min
cond2 = df['X']>=X_max
cond3 = df['X']>=Y_min
cond4 = df['X']>=Y_max
cond5 = df['X']>=Z_min
cond6 = df['X']>=Z_max
finish1 = time()
print(finish1 - start1)  # this is about 3-3.5 sec

start2 = time()
df2= df[conjunction(cond1&cond2&cond3&cond4&cond5&cond6)] does not take long.
finish2 = time()
print(finish2 - start2)  # this is about 0.002 sec

By the way, the code below took similar time (3-3.5 sec).

df2 = df[(df['X']>=X_min)&(df['X']<X_max)&(df['Y']>=Y_min)&(df['Y']<Y_max)&(df['Z']>=Z_min)&(df['Z']<Z_max)]

How could I possibly boost the speed up? Can I make it much faster by keeping pandas dataframe?


Solution

  • You may want to run df.info() to double-check the data types of columns. The comparison should be much faster on numeric values. It would be much slower if the columns are strings.