Trying to load the following parquet file, results in an Unexpected end of stream
error. Perhaps the parquet file is corrupted?
SELECT * FROM OPENROWSET(
BULK N'/path/to/my.parquet',
DATA_SOURCE='my_blob_datasource',
FORMAT='PARQUET'
) as somehow_necessary_alias
Statement ID: {1F956C7F-FA37-4D34-AC0D-D4BDF8336160}
Query hash: 0x729ED47717032037
Distributed request ID: {5815A501-2179-4968-BE71-3B7AACFE717C}.
Total size of data scanned is 27 megabytes,
total size of data moved is 290 megabytes,
total size of data written is 0 megabytes.
Msg 15813, Level 16, State 1, Line 1 Error handling external file:
'Unexpected end of stream'. File/External table name:
https://mystorageaccount.blob.core.windows.net/mycontainer/path/to/my.parquet
So we found out the answer, but I couldn't make a reproducible example that I could share, but I'll give as much explanation as I can. The parquet file was created by pandas.to_parquet()
using the pyarrow
engine. There's currently an arrow bug where an as-of-yet un-diagnosed pandas index quick translates to an extra column being added at the end named "__index_level_0__"
. I imagine this is what was surfacing the Unexpected end of stream
error.
This issue was first flagged as a pandas
issue, but then moved to an Arrow Jira ticket which has a reproducible example.
However, our scenario was different in that we weren't using pandas.to_parquet()
's partition_cols
param, so something else kooky was going on. Fortunately, the pandas docs touch on this by suggesting the following:
the code below creates a parquet file with three columns if you use
pyarrow
for serialization: a, b, and__index_level_0__
. If you’re usingfastparquet
, the index may or may not be written to the file.
python df.to_parquet("test.parquet", engine="pyarrow")
This unexpected extra column causes some databases like Amazon Redshift to reject the file, because that column doesn’t exist in the target table.
If you want to omit a
dataframe
’s indexes when writing, passindex=False
toto_parquet()
p.s. if you're using pandas
to create parquet files for upload into Synapse, be sure to pass use_deprecated_int96_timestamps=True
to to_parquet()
so Synapse will properly parse your datetime columns. See this question for related info