Dask Parquet loading files with data schema

This is a question related to this post.

I am experimenting with Dask and Parquet files. I loaded the New York parking violations data I downloaded here.

I read the data files, find common columns, apply datatypes, and save all afterwards as a parquet collevtion

from dask import dataframe as dd
from dask.diagnostics import ProgressBar
import numpy as np

base_url = 'origin/nyc-parking-tickets/'

fy14 = dd.read_csv(base_url + '*2014_.csv')
fy15 = dd.read_csv(base_url + '*2015.csv')
fy16 = dd.read_csv(base_url + '*2016.csv')
fy17 = dd.read_csv(base_url + '*2017.csv')

data = [fy14, fy15, fy16, fy17]
col_set = [set(d.columns) for d in data]
common_columns = list(set.intersection(*col_set))

# Set proper column types
dtype_tuples = [(x, np.str) for x in common_columns]
dtypes = dict(dtype_tuples)

floats = ['Feet From Curb', 'Issuer Code', 'Issuer Precinct', 'Law Section', 'Vehicle Year', 'Violation Precinct']
ints32 = ['Street Code1', 'Street Code2', 'Street Code3', 'Summons Number']
ints16 = ['Violation Code']

for item in floats: dtypes[item] = np.float32
for item in ints32: dtypes[item] = np.int32
for item in ints16: dtypes[item] = np.int16

# Read Data
data = dd.read_csv(base_url + '*.csv', dtype=dtypes, usecols=common_columns) # usecols not in Dask documentation, but from pandas

# Write data as parquet
target_url = 'target/nyc-parking-tickets-pq/'
with ProgressBar():
    data.to_parquet(target_url)

When I attempt to reload the data

data2 = dd.read_parquet(target_url, engine='pyarrow')

I get a ValueError, namely that some of the partitions have a different file format. Looking at the output, I can see that the 'Violation Legal Code' column is in one partition interpreted as null, presumably because the data is too sparse for sampling.

In the post with the original question two solutions are suggested. The first is about entering dummy values, the other is supplying column types when loading the data. I would like to do the latter and I am stuck. In the dd.read_csv method I can pass the dtype argument, for which I just enter the dtypes dictionary defined above. The dd.read_parquet does not accept that keyword. In the documentation it seems to suggest that categories is taking over that role , but even when passing categories=dtypes, I still get the same error.

How can I pass type specifications in dask.dataframe.read_parquet?

Solution

It seems the problem was with the parquet engine. When I changed the code to

data.to_parquet(target_url, engine = 'fastparquet')

and

data.from_parquet(target_url, engine = 'fastparquet')

the writing and loading worked fine.