Search code examples
pandasmappingapply

Efficient way to assign a value to a dataframe row, based on comparing value in another column with thresholds?


I have a large dataframe of 140,000 rows. In one of the columns is a value, which I need to compare with two thresholds, and then assign an index. Then, I also find the closest corresponding value in another index, but that index is specific to the prior task.

As an example, here's a similar task: I have a timeseries, ts, with temperature "temp" and humidity "h".

  1. Based on two temperature thresholds, select a "class" of 0, 1 or 2.
  2. There exist then three humidity arrays, one for each class. Find the index of the closest value to the humidity value in the specific row of ts. Note; in the example, I didn't pass the humidity arrays (h_ref) to the function, they are available globally.

I can do this via:

def temp_and_h(ts):
    temp_i = ts['temp']
    h_i = ts['h']
    
    if temp_i > 2:
        class=0
    elif temp_i < -2:
        class=1
    else:
        class=2

    # find the index of closest value in the correct (class-specific) array in "h_ref"
    diffs = [abs(x - h_i) for x in h_ref[class]]
    h_idx = diffs.index(min(diffs))

    return (class,h_idx)

Which is simply called via:

ts[['temp_idx', 'h_idx']] = ts.apply(stability_and_TI, axis=1)

However, on my machine this takes around ~670 seconds. I need to speed this up as much as possible. What other approaches can I try? I tried using np.vectorize, but the memory usage is too large (>100GB).


Solution

  • You can vectorize using pd.cut and groupby.apply:

    ts['temp_idx'] = pd.cut(ts['temp'],
                            bins=[-np.inf, -2, 2, np.inf],
                            labels=[1, 2, 0])
    
    ts['h_idx'] = (ts.groupby('temp_idx', group_keys=False)['h']
                     .apply(lambda g: pd.Series((g.to_numpy()[:,None]-h_ref[g.name]
                                                 ).argmin(axis=1),
                                                index=g.index
                                                )
                           )
                  )
    

    Example with dummy data:

                             temp         h temp_idx  h_idx
    2024-01-01 00:00:00  0.507979  0.029876        2      0
    2024-01-01 01:00:00  2.081478  0.456833        0      1
    2024-01-01 02:00:00 -2.090953  0.649144        1      4
    2024-01-01 03:00:00  0.108276  0.278487        2      0
    2024-01-01 04:00:00  3.929470  0.676255        0      1
    2024-01-01 05:00:00  3.962931  0.590863        0      1
    2024-01-01 06:00:00 -3.744147  0.023982        1      4
    2024-01-01 07:00:00 -2.927571  0.558854        1      4
    2024-01-01 08:00:00 -4.485328  0.259252        1      4
    2024-01-01 09:00:00 -0.591902  0.415101        2      0
    

    Reproducible input:

    import numpy as np
    
    np.random.seed(3)
    N = 10
    ts = pd.DataFrame({'temp': np.random.uniform(-5, 5, N),
                       'h': np.random.uniform(0, 1, N),
                      }, index=pd.date_range('2024-01-01', freq='h', periods=N))
    
    h_ref = np.random.random((5, 5))