Efficient way to assign a value to a dataframe row, based on comparing value in another column with thresholds?

I have a large dataframe of 140,000 rows. In one of the columns is a value, which I need to compare with two thresholds, and then assign an index. Then, I also find the closest corresponding value in another index, but that index is specific to the prior task.

As an example, here's a similar task: I have a timeseries, ts, with temperature "temp" and humidity "h".

Based on two temperature thresholds, select a "class" of 0, 1 or 2.
There exist then three humidity arrays, one for each class. Find the index of the closest value to the humidity value in the specific row of ts. Note; in the example, I didn't pass the humidity arrays (h_ref) to the function, they are available globally.

I can do this via:

def temp_and_h(ts):
    temp_i = ts['temp']
    h_i = ts['h']
    
    if temp_i > 2:
        class=0
    elif temp_i < -2:
        class=1
    else:
        class=2

    # find the index of closest value in the correct (class-specific) array in "h_ref"
    diffs = [abs(x - h_i) for x in h_ref[class]]
    h_idx = diffs.index(min(diffs))

    return (class,h_idx)

Which is simply called via:

ts[['temp_idx', 'h_idx']] = ts.apply(stability_and_TI, axis=1)

However, on my machine this takes around ~670 seconds. I need to speed this up as much as possible. What other approaches can I try? I tried using np.vectorize, but the memory usage is too large (>100GB).

Solution

You can vectorize using pd.cut and groupby.apply:

ts['temp_idx'] = pd.cut(ts['temp'],
                        bins=[-np.inf, -2, 2, np.inf],
                        labels=[1, 2, 0])

ts['h_idx'] = (ts.groupby('temp_idx', group_keys=False)['h']
                 .apply(lambda g: pd.Series((g.to_numpy()[:,None]-h_ref[g.name]
                                             ).argmin(axis=1),
                                            index=g.index
                                            )
                       )
              )

Example with dummy data:

                         temp         h temp_idx  h_idx
2024-01-01 00:00:00  0.507979  0.029876        2      0
2024-01-01 01:00:00  2.081478  0.456833        0      1
2024-01-01 02:00:00 -2.090953  0.649144        1      4
2024-01-01 03:00:00  0.108276  0.278487        2      0
2024-01-01 04:00:00  3.929470  0.676255        0      1
2024-01-01 05:00:00  3.962931  0.590863        0      1
2024-01-01 06:00:00 -3.744147  0.023982        1      4
2024-01-01 07:00:00 -2.927571  0.558854        1      4
2024-01-01 08:00:00 -4.485328  0.259252        1      4
2024-01-01 09:00:00 -0.591902  0.415101        2      0

Reproducible input:

import numpy as np

np.random.seed(3)
N = 10
ts = pd.DataFrame({'temp': np.random.uniform(-5, 5, N),
                   'h': np.random.uniform(0, 1, N),
                  }, index=pd.date_range('2024-01-01', freq='h', periods=N))

h_ref = np.random.random((5, 5))