Search code examples
pythonnumpystatisticsdata-processing

Python 3, Numpy: Split data into blocks of fixed length and calculate statistics for each block


Quick Solution

If you just want to split a numpy array or python list into arrays or lists of fixed length do this:

l = 10 # the fixed length of output array
output = [input[l*i:l*(i+1)-1] for i in range(0, len(input) // l)]

If the input is not integer divisible by l but you want to include the final (shorter) array in the output, do the following:

l = 10 # the fixed length of output array
output = [input[l*i:l*(i+1)-1] for i in range(0, (len(input) + l - 1) // l)]

Full Question

I am trying to calculate some statistics for some data. Example statistics include the mean, standard deviation, minimum and maximum.

The data is formatted as a python numpy array. Here is a simple example:

data_in = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
data_array = numpy.array(data_in)

Here the example has an array length of 10, but in practice, consider that the input data has length of order 1 million elements (but not an exact round number) and the output should take statistics for blocks of perhaps 10k elements.

Here is how I have attempted this. The example below is just shown for the mean statistic.

mean_out = [numpy.mean(data_array[2*i:2*i+1]) for i in range(0, len(data_array) // 2)]

This doesn't seem like a particularly elegant solution. The "block length" here is 2, and this appears in 3 places in the above expression.

This can be written in a more general way using bl for the block length.

mean_out = [numpy.mean(data_array[bl*i:bl*(i+1)-1]) for i in range(0, len(data_array) // bl)]

In addition to this, the above does not work when the input data length is not whole divisible by the block length. For example, changing the block length to 3 results in an output with length 3.

Since 3 * 3 = 9, the final element is missing from the calculation.

This can be "fixed" by using the following expression:

mean_out = [numpy.mean(data_array[bl*i:bl*(i+1)-1]) for i in range(0, (len(data_array) + bl - 1) // bl)]

But again, this isn't particularly elegant.

Is there an inbuilt python or numpy function to calculate these statistics by splitting an input array into fixed length blocks? Or alternatively is there a better way to do this calculation which I am not aware of?


Solution

  • Numpy has array_split to split array into blocks. To calculate the mean for each block you can use map

    data_arrays = np.array_split(data_array, len(data_array) // 2)
    print(data_arrays) # [array([1, 2]), array([3, 4]), array([5, 6]), array([7, 8]), array([ 9, 10])]
    print(list(map(np.mean, data_arrays))) # [1.5, 3.5, 5.5, 7.5, 9.5]
    
    data_arrays = np.array_split(data_array, len(data_array) // 3)
    print(data_arrays) # [array([1, 2, 3, 4]), array([5, 6, 7]), array([ 8,  9, 10])]
    print(list(map(np.mean, data_arrays))) # [2.5, 6.0, 9.0]
    

    Note: To map returns in iter object, to convert to get the output in the same format (numpy array), the following is required:

    numpy.fromiter(map(numpy.mean, data_array), dtype=numpy.float)
    

    The same thing can be accomplished by converting to a list, and then to a numpy array, as shown above.