Search code examples
pythonarraysnumpyscipystride

how to extract overlapping sub-arrays with a window size and flatten them


I am trying to get better at using numpy functions and methods to run my programs in python faster

I want to do the following:

I create an array 'a' as:

a=np.random.randint(-10,11,10000).reshape(-1,10) 

a.shape: (1000,10)

I create another array which takes only the first two columns in array a

b=a[:,0:2] 

b,shape: (1000,2)

now I want to create an array c which has 990 rows containing flattened slices of 10 rows of array 'b'. So the first row of array 'c' will have 20 columns which is a flattened slice of 0 to 10 rows of array 'b'. The next row of array 'c' will have 20 columns of flattened rows 1 to 11 of array 'b' etc.

I can do this with a for loop. But I want to know if there is much faster way to do this using numpy functions and methods like strides or something else

Thanks for your time and your help.


Solution

  • This loops over shifts rather than rows (loop of size 10):

    N = 10
    c = np.hstack([b[i:i-N] for i in range(N)])  
    

    Explanation: b[i:i-N] is b's rows from i to m-(N-i)(excluding m-(N-i) itself) where m is number of rows in b. Then np.hstack stacks those selected sub-arrays horizontally(stacks b[0:m-10], b[1:m-9], b[2:m-8], ..., b[10:m]) (as question explains).

    c.shape: (990, 20)

    Also I think you may be looking for a shape of (991, 20) if you want to include all windows.

    you can also use strides, but if you want to do operations on it, I would advise against that, since the memory is tricky using them. Here is a strides solution if you insist:

    from skimage.util.shape import view_as_windows
    c = view_as_windows(b, (10,2)).reshape(-1, 20)
    

    c.shape: (991, 20)

    If you don't want the last row, simply remove it by calling c[:-1].
    A similar solution applies with numpy's as_strides function (they basically operate similar, not sure of internals of them).

    UPDATE: if you want to find unique values and their frequencies in each row of c you can do:

    unique_values = []
    unique_counts = []
    for row in c:
      unique, unique_c = np.unique(row, return_counts=True)
      unique_values.append(unique)
      unique_counts.append(unique_c)
    

    Note that numpy arrays have to be rectangular, meaning the number of elements per each(dimension) row must be the same. Since different rows in c can have different number of unique values, you cannot create a numpy array for unique values of each row (Alternative would be to make a structured numpy array). Therefore, a solution is to make a list/array of arrays, each including unique values of different rows in c. unique_values are list of arrays of unique values and unique_counts is their frequency in the same order.