This is a question related to this post.
I am experimenting with Dask and Parquet files. I loaded the New York parking violations data I downloaded here.
I read the data files, find common columns, apply datatypes, and save all afterwards as a parquet collevtion
from dask import dataframe as dd
from dask.diagnostics import ProgressBar
import numpy as np
base_url = 'origin/nyc-parking-tickets/'
fy14 = dd.read_csv(base_url + '*2014_.csv')
fy15 = dd.read_csv(base_url + '*2015.csv')
fy16 = dd.read_csv(base_url + '*2016.csv')
fy17 = dd.read_csv(base_url + '*2017.csv')
data = [fy14, fy15, fy16, fy17]
col_set = [set(d.columns) for d in data]
common_columns = list(set.intersection(*col_set))
# Set proper column types
dtype_tuples = [(x, np.str) for x in common_columns]
dtypes = dict(dtype_tuples)
floats = ['Feet From Curb', 'Issuer Code', 'Issuer Precinct', 'Law Section', 'Vehicle Year', 'Violation Precinct']
ints32 = ['Street Code1', 'Street Code2', 'Street Code3', 'Summons Number']
ints16 = ['Violation Code']
for item in floats: dtypes[item] = np.float32
for item in ints32: dtypes[item] = np.int32
for item in ints16: dtypes[item] = np.int16
# Read Data
data = dd.read_csv(base_url + '*.csv', dtype=dtypes, usecols=common_columns) # usecols not in Dask documentation, but from pandas
# Write data as parquet
target_url = 'target/nyc-parking-tickets-pq/'
with ProgressBar():
data.to_parquet(target_url)
When I attempt to reload the data
data2 = dd.read_parquet(target_url, engine='pyarrow')
I get a ValueError, namely that some of the partitions have a different file format. Looking at the output, I can see that the 'Violation Legal Code' column is in one partition interpreted as null, presumably because the data is too sparse for sampling.
In the post with the original question two solutions are suggested. The first is about entering dummy values, the other is supplying column types when loading the data. I would like to do the latter and I am stuck.
In the dd.read_csv
method I can pass the dtype
argument, for which I just enter the dtypes
dictionary defined above. The dd.read_parquet
does not accept that keyword. In the documentation it seems to suggest that categories
is taking over that role , but even when passing categories=dtypes
, I still get the same error.
How can I pass type specifications in dask.dataframe.read_parquet
?
It seems the problem was with the parquet engine. When I changed the code to
data.to_parquet(target_url, engine = 'fastparquet')
and
data.from_parquet(target_url, engine = 'fastparquet')
the writing and loading worked fine.