Search code examples
pythonpandasnumpyvectorization

Vectorizing drift change in Pandas


Here's the Python code

t0 = df['Temperature'].iloc[0] # Dataframe df with column 'Temperature' is already given
df['DriftedTemp'] = None

for i in range(1,len(df)):
    if(np.abs(df['Temperature'].iloc[i] - t0) > toffset): # toffset is a parameter that is given
        df['DriftedTemp'].iloc[i] = df['Temperature'].iloc[i]
        t0 = df['Temperature'].iloc[i]

It figures out the rows when the temperature drifted from the previously recorded value by more than "toffset", and updates the "DriftedTemp" column at that row with this new value, and "t0" as well to the "Temperature" at a point where the drift happens.

The issue with such codes is that the current value depends on the previous value when it was evaluated in a previous row. Vectorization treats each column as vectors so the changed state of previous rows do not get reflected through simple vectorization.

This can be implemented using a while loop and vectorization but I cannot think of a simple vectorization technique without any loops to accomplish the same task.


Solution

  • Vectorization might not be possible since the computation of drift depends on the previous state of drift having said that this is a good use case for using numba basically create a function with the logic and then compile it with numba to achieve C like speeds.

    import numba
    
    @numba.njit
    def drift(temperatures, toffset):
        drift = np.full_like(temperatures, fill_value=np.nan, dtype='float')
    
        for i, t in enumerate(temperatures):
            if i == 0:
                t0 = t        
            elif abs(t - t0) > toffset:
                t0 = drift[i] = t
    
        return drift
    
    
    df['DriftedTemp'] = drift(df['Temperature'].to_numpy(), 2)