I have a Dask dataframe, one column of which contains a numpy array of floats:
import dask.dataframe as dd
import pandas as pd
import numpy as np
df = dd.from_pandas(
pd.DataFrame(
{
'id':range(1, 6),
'vec':[np.array([1.0, 2.0, 3.0, 4.0, 5.0])] * 5
}), npartitions=1)
df.compute()
id vec
0 1 [1.0, 2.0, 3.0, 4.0, 5.0]
1 2 [1.0, 2.0, 3.0, 4.0, 5.0]
2 3 [1.0, 2.0, 3.0, 4.0, 5.0]
3 4 [1.0, 2.0, 3.0, 4.0, 5.0]
4 5 [1.0, 2.0, 3.0, 4.0, 5.0]
If I try writing this out as parquet I get an error:
df.to_parquet('somefile')
....
Error converting column "vec" to bytes using encoding UTF8. Original error: bad argument type for built-in operation
I presume this is because the 'vec' column has type 'object', and so the parquet serializer tries to write it as a string. Is there some way to tell either the Dask DataFrame or the serializer that the column is an array of float?
I have discovered it is possible if the pyarrow engine is used instead of the default fastparquet:
pip/conda install pyarrow
then:
df.to_parquet('somefile', engine='pyarrow')
The docs for fastparquet at https://github.com/dask/fastparquet/ say "only simple data-types and plain encoding are supported", so I guess that means no arrays.