Search code examples
pythonnumpydaskdask-delayed

efficiently create dask.array from a dask.Series of lists


What is the most efficient way to create a dask.array from a dask.Series of list? The series consists of 5 million lists 300 of elements. It is currently divide into 500 partitions. Currently I am trying:

pt = [delayed(np.array)(y)
      for y in
      [delayed(list)(x)
       for x in series.to_delayed()]]
da = delayed(dask.array.concatenate)(pt, axis=1)
da = dask.array.from_delayed(da, (vec.size.compute(), 300), dtype=float)

The idea is to convert each partition into a numpy array and stitch those together into a dask.array. This code is taking forever to run though. A numpy array can be built from this data quite quickly from this data sequentially as long as there is enough RAM.


Solution

  • I think that you are on the right track using dask.delayed. However calling list on the series is probably not ideal. I would create a function that converts one of your series into a numpy array and then go through delayed with that.

    def convert_series_to_array(pandas_series):  # make this as fast as you can
        ...
        return numpy_array
    
    L = dask_series.to_delayed()
    L = [delayed(convert_series_to_array)(x) for x in L]
    arrays = [da.from_delayed(x, shape=(np.nan, 300), dtype=...) for x in L]
    x = da.concatenate(arrays, axis=0)
    

    Also, regarding this line:

    da = delayed(dask.array.concatenate)(pt, axis=1)
    

    You should never call delayed on a dask function. They are already lazy.