Search code examples
pythonpandaspython-multiprocessingpython-multithreadingdask

How to use dask to populate DataFrame in parallelized task?


I would like to use dask to parallelize a numbercrunching task.

This task utilizes only one of the cores in my computer.

As a result of that task I would like to add an entry to a DataFrame via shared_df.loc[len(shared_df)] = [x, 'y']. This DataFrame should be populized by all the (four) paralllel workers / threads in my computer.

How do I have to setup dask to perform this?


Solution

  • The right way to do something like this, in rough outline:

    • make a function that, for a given argument, returns a data-frame of some part of the total data

    • wrap this function in dask.delayed, make a list of calls for each input argument, and make a dask-dataframe with dd.from_delayed

    • if you really need the index to be sorted and the index to partition along different lines than the chunking you applied in the previous step, you may want to do set_index

    Please read the docstrings and examples for each of these steps!