Currently the integration between Spark structures and Dask seems cubersome when dealing with complicated nested structures. Specifically dumping a Spark Dataframe with nested structure to be read by Dask seems to not be very reliable yet although the parquet loading is part of a large ongoing effort (fastparquet, pyarrow);
so my follow up question - Let's assume that I can live with doing a few transformations in Spark and transform the DataFrame into an RDD that contains custom class objects; Is there a way to reliably dump the data of an Spark RDD with custom class objects and read it in a Dask collection? Obviously you can collect the rdd into a python list, pickle it and then read it as a normal data structure but that removes the opportunity to load larger than memory datasets. Could something like the spark pickling be used by dask to load a distributed pickle?
I solved this by doing the following
Having a Spark RDD with a list of custom objects as Row values I created a version of the rdd where I serialised the objects to strings using cPickle.dumps
. Then converted this RDD to a simple DF with string columns and wrote it to parquet. Dask
is able to read parquet files with simple structure. Then deserialised with cPickle.loads
to get the original objects