Search code examples
daskdask-distributeddask-delayed

dask.distributed not utilising the cluster


I'm not able to process this block using the distributed cluster.

import pandas as pd
from dask import dataframe as dd 
import dask

df = pd.DataFrame({'reid_encod': [[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]]})
dask_df = dd.from_pandas(df, npartitions=3)
save_val = []
def add(dask_df):
    for _, outer_row in dask_df.iterrows():
        for _, inner_row in dask_df.iterrows():
            for base_encod in outer_row['reid_encod']:
               for compare_encod in inner_row['reid_encod']:
                   val = base_encod + compare_encod
                   save_val.append(val)
    return save_val

from dask.distributed import Client

client = Client(...)
dask_compute = dask.delayed(add)(dask_df)
dask_compute.compute()

Also I have few queries

  1. Does dask.delayed use the available clusters to do the computation.

  2. Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.

  3. does dask.distributed work on pandas dataframe.

  4. can we use dask.delayed in dask.distributed.

  5. If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.


Solution

  • For the record, some answers, although I wish to note my earlier general points about this question

    Does dask.delayed use the available clusters to do the computation.

    If you have created a client to a distributed cluster, dask will use it for computation unless you specify otherwise.

    Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.

    Yes, you can in general use delayed with pandas dataframes for parallelism if you wish. However, your dataframe only has one row, so it is not obvious in this case how - it depends on what you really want to achieve.

    does dask.distributed work on pandas dataframe.

    Yes, you can do anything that python can do with distributed, since it is just python processes executing code. Whether it brings you the performance you are after is a separate question

    can we use dask.delayed in dask.distributed.

    Yes, distributed can execute anything that dask in general can, including delayed functions/objects

    If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.

    Not easily, it is not clear to me that this is a dataframe operation at all. It seems more like an array - but, again, I note that your function does not actually return anything useful at all.

    In the tutorial: passing pandas dataframes to delayed ; same with dataframe API.