Search code examples
t-sqlparquetazure-synapse

Azure Synapse Serverless -- Unexpected end of stream loading parquet in Azure Synapse Serverless pool


Trying to load the following parquet file, results in an Unexpected end of stream error. Perhaps the parquet file is corrupted?

intended query

SELECT * FROM OPENROWSET(
        BULK N'/path/to/my.parquet',
        DATA_SOURCE='my_blob_datasource',
        FORMAT='PARQUET'
) as somehow_necessary_alias

error statement

Statement ID: {1F956C7F-FA37-4D34-AC0D-D4BDF8336160}
Query hash: 0x729ED47717032037
Distributed request ID: {5815A501-2179-4968-BE71-3B7AACFE717C}.
Total size of data scanned is 27 megabytes,
total size of data moved is 290 megabytes,
total size of data written is 0 megabytes.

Msg 15813, Level 16, State 1, Line 1 Error handling external file: 
'Unexpected end of stream'. File/External table name:
https://mystorageaccount.blob.core.windows.net/mycontainer/path/to/my.parquet

Solution

  • So we found out the answer, but I couldn't make a reproducible example that I could share, but I'll give as much explanation as I can. The parquet file was created by pandas.to_parquet() using the pyarrow engine. There's currently an arrow bug where an as-of-yet un-diagnosed pandas index quick translates to an extra column being added at the end named "__index_level_0__". I imagine this is what was surfacing the Unexpected end of stream error.

    This issue was first flagged as a pandas issue, but then moved to an Arrow Jira ticket which has a reproducible example.

    However, our scenario was different in that we weren't using pandas.to_parquet()'s partition_cols param, so something else kooky was going on. Fortunately, the pandas docs touch on this by suggesting the following:

    the code below creates a parquet file with three columns if you use pyarrow for serialization: a, b, and __index_level_0__. If you’re using fastparquet, the index may or may not be written to the file.

    python df.to_parquet("test.parquet", engine="pyarrow")

    This unexpected extra column causes some databases like Amazon Redshift to reject the file, because that column doesn’t exist in the target table.

    If you want to omit a dataframe’s indexes when writing, pass index=False to to_parquet()

    p.s. if you're using pandas to create parquet files for upload into Synapse, be sure to pass use_deprecated_int96_timestamps=True to to_parquet() so Synapse will properly parse your datetime columns. See this question for related info