I have a large dataframe of 140,000 rows. In one of the columns is a value, which I need to compare with two thresholds, and then assign an index. Then, I also find the closest corresponding value in another index, but that index is specific to the prior task.
As an example, here's a similar task: I have a timeseries, ts, with temperature "temp" and humidity "h".
I can do this via:
def temp_and_h(ts):
temp_i = ts['temp']
h_i = ts['h']
if temp_i > 2:
class=0
elif temp_i < -2:
class=1
else:
class=2
# find the index of closest value in the correct (class-specific) array in "h_ref"
diffs = [abs(x - h_i) for x in h_ref[class]]
h_idx = diffs.index(min(diffs))
return (class,h_idx)
Which is simply called via:
ts[['temp_idx', 'h_idx']] = ts.apply(stability_and_TI, axis=1)
However, on my machine this takes around ~670 seconds. I need to speed this up as much as possible. What other approaches can I try? I tried using np.vectorize, but the memory usage is too large (>100GB).
You can vectorize using pd.cut
and groupby.apply
:
ts['temp_idx'] = pd.cut(ts['temp'],
bins=[-np.inf, -2, 2, np.inf],
labels=[1, 2, 0])
ts['h_idx'] = (ts.groupby('temp_idx', group_keys=False)['h']
.apply(lambda g: pd.Series((g.to_numpy()[:,None]-h_ref[g.name]
).argmin(axis=1),
index=g.index
)
)
)
Example with dummy data:
temp h temp_idx h_idx
2024-01-01 00:00:00 0.507979 0.029876 2 0
2024-01-01 01:00:00 2.081478 0.456833 0 1
2024-01-01 02:00:00 -2.090953 0.649144 1 4
2024-01-01 03:00:00 0.108276 0.278487 2 0
2024-01-01 04:00:00 3.929470 0.676255 0 1
2024-01-01 05:00:00 3.962931 0.590863 0 1
2024-01-01 06:00:00 -3.744147 0.023982 1 4
2024-01-01 07:00:00 -2.927571 0.558854 1 4
2024-01-01 08:00:00 -4.485328 0.259252 1 4
2024-01-01 09:00:00 -0.591902 0.415101 2 0
Reproducible input:
import numpy as np
np.random.seed(3)
N = 10
ts = pd.DataFrame({'temp': np.random.uniform(-5, 5, N),
'h': np.random.uniform(0, 1, N),
}, index=pd.date_range('2024-01-01', freq='h', periods=N))
h_ref = np.random.random((5, 5))