Search code examples
pythonarraysnumpysignal-processingwindowing

Fast conditional overlapping windowing (framing) of numpy array


I have a huge list of numpy arrays (1 dimensional), which are time series for different events. Each point has a label, and I want to window the numpy arrays based on its label. The labels I have is 0, 1, and 2. Each window has a fixed size M.

The label of each window will be the biggest label available in the window. So if a window consists of both 0 an 1 labeled datapoints, the label will be 1 for the whole window.

But the problem is that, the windowing is not label agnostic. Because of class imbalance, I want to only do overlapped windowing in case of labels 1 and 2.

So far I have written this code:

# conditional framing
data = []
start_cursor = 0
while start_cursor < arr.size:
  end_cursor = start_cursor + window_size
  data.append(
    {
      "frame": arr[start_cursor:end_cursor],
      "label": y[start_cursor:end_cursor].max(),
    }
  )
  start_cursor = end_cursor
  if np.any(y[start_cursor, end_cursor] != 0):
    start_cursor = start_cursor - overlap_size        

But this is clearly too verbose and just plain inefficient, especially because I will call this while loop on my huge list of separate arrays.

EDIT: to explain the problem more. Imagine you are going to window a signal with fixed length M. If there only exists 0 label points in the window, there will be no overlap between adjacent windows. But if there exists labels 1 and 2, there will be an overlap between two signals with percentage p%.


Solution

  • I think this does what you are asking to do. The visualization for checking isn't great, but it helps you see how the windowing works. Hopefully I understood your question right and this is what you are trying to do. Anytime there is a 1 or 2 in the time series (rather than a 0) the window steps forward some fraction of the full window length (here 50%). windowing approach

    To examine how to do this, start with a sample time series:

    import matplotlib.pylab as plt
    import numpy as np
    
    N = 5000 # time series length
    
    # create some sort of data set to work with
    x = np.zeros(N)
    # add a few 1s and 2s to the list (though really they are the same for the windowing)
    y = np.random.random(N)
    x[y < 0.01] = 1
    x[y < 0.005] = 2
    
    # assign a window length
    M = 50 # window length
    overlap = 0.5 # assume 50% overlap
    M_overlap = int(M * (1-overlap))
    

    My approach is to sum the window of interest over your time series. If the sum ==0, there is no overlap between windows and if it is >0 then there is overlap. The question, then, becomes how should we calculate these sums efficiently? I compare two approaches. The first is simply to walk through the time series and the second is to use convolution (which is much faster). For the first one, I also explore different ways of assessing window size after summation.

    Summation (slow version)

    def window_sum1():
        # start of windows in list windows
        windows = [0,]
        while windows[-1] + M < N:
            check = sum(x[windows[-1]:windows[-1]+M]) == 0
            windows.append(windows[-1] + M_overlap + (M - M_overlap) * check)
            if windows[-1] + M > N:
                windows.pop()
                break
        # plotting stuff for checking
        return(windows)
    Niter = 10**4
    print(timeit.timeit(window_sum1, number = Niter))
    # 29.201083058
    

    So this approach went through 10,000 time series of length 5000 in about 30 seconds. But the line windows.append(windows[-1] + M_overlap + (M - M_overlap) * check) can be streamlined in an if statement.

    Summation (fast version, 33% faster than slow version)

    def window_sum2():
        # start of windows in list windows
        windows = [0,]
        while windows[-1] + M < N:
            check = sum(x[windows[-1]:windows[-1]+M]) == 0
            if check:
                windows.append(windows[-1] + M)
            else:
                windows.append(windows[-1] + M_overlap)
            if windows[-1] + M > N:
                windows.pop()
                break
        # plotting stuff for checking
        return(windows)
    print(timeit.timeit(window_sum2, number = Niter))
    # 20.456240447000003
    

    We see a 1/3 reduction in time with the if statement.

    Convolution (85% faster than fast summation)

    We can use signal processing to get a lot faster, by convolving the time series with the window of interest using numpy.convolve. (Disclaimer: I got the idea from the accepted answer to this question.) Of course, it also makes sense to adopt the faster window size assessment from above.

    def window_conv():
        a = np.convolve(x,np.ones(M,dtype=int),'valid')
        windows = [0,]
        while windows[-1] + M < N:
            if a[windows[-1]]:
                windows.append(windows[-1] + M_overlap)
            else:
                windows.append(windows[-1] + M)
            if windows[-1] + M > N:
                windows.pop()
                break
        return(windows)
    print(timeit.timeit(window_conv, number = Niter))
    #3.3695770570000008
    

    Sliding window

    The last thing I will add is that, as shown in one of the comments of this question, as of numpy 1.20 there is a function called sliding_window_view. I still have numpy 1.19 running and was not able to test it to see if it's faster than convolution.