Search code examples
pythondaskparquetpyarrow

Reading snappy parquet files on Windows causes python to crash


I'm unable to read snappy parquet files via pyarrow on Windows.

import dask.dataframe as dd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(15, 4)), columns=list('ABCD'))
dd_df = dd.from_pandas(df, npartitions=1)
dd_df.to_parquet("my_df.snappy.parquet", engine="pyarrow", compression="snappy")
dd_df_copy = dd.read_parquet("my_df.snappy.parquet", engine="pyarrow")
dd_df_copy.compute() #<--- This is where it crashes

I've replicated this problem in a clean Anaconda environment with Python 3.8. After creating the environment, I ran pip install "dask[complete]" and pip install pyarrow

The error is:

Problem signature:
  Problem Event Name:   APPCRASH
  Application Name: python.exe
  Application Version:  3.8.3150.1013
  Application Timestamp:    5ed53446
  Fault Module Name:    arrow.dll
  Fault Module Version: 0.0.0.0
  Fault Module Timestamp:   5ebd3029
  Exception Code:   c000001d
  Exception Offset: 00000000007abfc7
  OS Version:   6.3.9600.2.0.0.16.7
  Locale ID:    1033
  Additional Information 1: d8e4
  Additional Information 2: d8e42c04b828d96accf490cd13472bea
  Additional Information 3: aebe
  Additional Information 4: aebe917bfb5c1b58e884baa1f9c3d3d2

Similar versions of the crash obtain when I try using conda -c conda-forge dask pyarrow:

Problem signature:
  Problem Event Name:   APPCRASH
  Application Name: python.exe
  Application Version:  3.8.3150.1013
  Application Timestamp:    5ed53446
  Fault Module Name:    arrow.dll
  Fault Module Version: 0.0.0.0
  Fault Module Timestamp:   5ecf56ac
  Exception Code:   c000001d
  Exception Offset: 0000000000521587
  OS Version:   6.3.9600.2.0.0.16.7
  Locale ID:    1033
  Additional Information 1: e863
  Additional Information 2: e8638a01b9fb70505b0604ef9b98f3c6
  Additional Information 3: 1e47
  Additional Information 4: 1e47c852f479606e071f3ea8f80878a1

Solution

  • Updating packages fixed this as of July 1, 2020. I think it was a pyarrow update that did it.