Search code examples
arrayspython-3.xnumpydaskawkward-array

Python collection of different sized arrays (Jagged arrays), Dask?


I have multiple 1-D numpy arrays of different size representing audio data. Since they're different sizes (e.g (8200,), (13246,), (61581,)), I cannot stack them as 1 array with numpy. The size difference is too big to engage in 0-padding.

I can keep them in a list or dictionary and then use for loops to iterate over them to do calculations, but I would prefer that I could approach it in numpy-style. Calling a numpy function on the variable, without having to write a for-loop. Something like:

np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
np_mix = irregular_stack(np0, np1)
np.sum(np_mix)
# output: [-0.7, 0.09999999999999998]

dask-arrays

Looking at this Dask picture, I was wondering if I can do what I want with Dask.

My attempt so far is this:

import numpy as np
import dask.array as da

np0 = np.array([.2, -.4, -.5])
arr0 = da.from_array(np0, chunks=(3,))
np1 = np.array([-.8, .9])
arr1 = da.from_array(np1, chunks=(2,))

# stack them
data = [[arr0],
        [arr1]]

x = da.block(data)
x.compute()

# output: ValueError: ('Shapes do not align: %s', [(1, 3), (1, 2)])

Questions

  1. Am I misunderstanding how Dask can be used?
  2. If it's possible, how do I do my np.sum() example?
  3. If it's possible, is it actually faster than a for-loop on a high-end single PC?

Solution

  • I found the library awkward-array (https://github.com/scikit-hep/awkward-array), which allows for different length arrays and can do what I asked for:

    import numpy as np
    import awkward
    
    np0 = np.array([.2, -.4, -.5])
    np1 = np.array([-.8, .9])
    varlen = awkward.fromiter([np0, np1])
    # <JaggedArray [[0.2 -0.4 -0.5] [-0.8 0.9]] at 0x7f01a743e790>
    
    varlen.sum()
    # output: array([-0.7,  0.1])
    

    The library describes itself as: "Manipulate arrays of complex data structures as easily as Numpy."

    So far, it seems to satisfies everything I need.