I would like to run computations on a partitioned dataframe but not collect results in the end. In the simplified example below, I'm doing a simple shuffle collecting rows with alike shuffle_value
s. I then want to output something from each partition independently. The output of ddf.compute()
collects the full Pandas dataframe which I don't want to do because A) I don't need the full dataframe in one place and B) it potentially won't fit in memory, besides. In small test cases something like below works for me, but it won't scale.
How do I trigger computation on each partition but not send any results to the calling process?
ddf = dd.read_parquet(...)
ddf["shuffle_value"] = ddf.map_partitions(calc_shuffle_value,
meta=("shuffle_value", "int64"))
ddf = ddf.shuffle(ddf.shuffle_value, ignore_index=True, npartitions=N)
ddf.map_partitions(output_func, meta=("unused_value", "int64"))
df = ddf.compute()
Thanks!
In general, assuming that output_func
mostly creates output by side-effects and doesn't return anything useful, there is no problem with executing your process with .compute()
, as the dataframe returned will be tiny.
Assuming you are using distributed, other options include .persist()
-ing the final dataframe, which will execute the process (asynchronously) and store the not-useful return values of each partition in the cluster but not gather them. You can then del
to remove those data without collecting. The client.submit
API would do a very similar thing.
df_out = ddf.persist()
# wait to finish - look at dashboard or use progress bar
del df_out # release