Search code examples
pythonpandasnumpyiterationmathematical-optimization

How to efficiently standardize data in relation to preceding entries


I am trying to write a python script to standardize a big set of data (>10000 entries) in range of -1 to 1 in relation to preceding entries.

What I'm currently doing is iterating through the set and slicing the data array from 10 indices before the iteration index until the iteration index (inclusively). This then allows me to calculate current iteration entry in relation to preceding 10 entries. This works but it takes a really long time on my machine.

I wonder if there's any way to optimize the process. I am ok with the use of external libraries such as pandas and numpy.

I am very aware of performance differences between python arrays, pandas and numpy, but I am more interested in more general optimization techniques if such exist, for use in most scenarios. I am looking for mathematical and generic algorithmic optimizations rather than direct computational performance.

Here's a simplified representation of my code:

def standardize_last(data):
    min = data[0]
    max = data[0]
    for i in range(0, len(data)):
        entry = data[i]
        if entry < min:
            min = entry
        elif entry > max:
            max = entry
    if min == max:
        return 0
    else:
        # 1. Shift range minimum to 0
        # 2. Divide value by range to get a standard value from 0 to 1
        # 3. Move to range from -1 to 1
        return (((data[-1] - min) / (max - min)) * 2) - 1

results = []
dataset = [2,2,2,2,2,2,2,2,2,-2,-1,0,1,2,3,2]

# Standardize
for i in range(9, len(dataset)):
    slice = dataset[i - 9:i + 1]
    val = standardize_last(slice)
    results.append(val)

print(results)

Output (results start from 9th dataset index):

[-1.0, -0.5, 0.0, 0.5, 1.0, 1.0, 0.6000000000000001]


Solution

  • Try using vectorized operations:

    from numpy.lib.stride_tricks import sliding_window_view
    
    v = sliding_window_view(dataset, 10)
    mn = v.min(1)
    mx = v.max(1)
    
    (v[:, -1] - mn)/(mx - mn) * 2 - 1
    array([-1. , -0.5,  0. ,  0.5,  1. ,  1. ,  0.6])
    

    If using pandas:

    import pandas as pd
    
    s = pd.Series(dataset)
    t = s.rolling(10)
    ((s - (mn := t.min()))/ (t.max() - mn) * 2 - 1).dropna()
    
    9    -1.0
    10   -0.5
    11    0.0
    12    0.5
    13    1.0
    14    1.0
    15    0.6
    dtype: float64