python pyspark dask dask-distributed fastparquet

Is there any good way to read the content of a Spark RDD into a Dask structure

Currently the integration between Spark structures and Dask seems cubersome when dealing with complicated nested structures. Specifically dumping a Spark Dataframe with nested structure to be read by Dask seems to not be very reliable yet although the parquet loading is part of a large ongoing effort (fastparquet, pyarrow);

so my follow up question - Let's assume that I can live with doing a few transformations in Spark and transform the DataFrame into an RDD that contains custom class objects; Is there a way to reliably dump the data of an Spark RDD with custom class objects and read it in a Dask collection? Obviously you can collect the rdd into a python list, pickle it and then read it as a normal data structure but that removes the opportunity to load larger than memory datasets. Could something like the spark pickling be used by dask to load a distributed pickle?

Solution

I solved this by doing the following

Having a Spark RDD with a list of custom objects as Row values I created a version of the rdd where I serialised the objects to strings using cPickle.dumps. Then converted this RDD to a simple DF with string columns and wrote it to parquet. Dask is able to read parquet files with simple structure. Then deserialised with cPickle.loads to get the original objects