Search code examples
snowflake-cloud-data-platformparquet

parquet files generated by snowflake are not readable by other tools


I'm trying to dump a snowflake table to parquet

rereading the parquet file from within snowflake works, but when reading it using other tools (pandas, pyarrow,...) I get an error about the format

code to reproduce:

from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
import os
from snowflake.ml.fileset import sfcfs
import pandas

connection_parameters = {} # this is setup specific

snowpark_session = Session.builder.configs(connection_parameters).create()

df = snowpark_session.createDataFrame(pandas.DataFrame({'a': [1,2,3]}))
full_name = f'{snowpark_session.get_session_stage()}/report1'
df.write.parquet(full_name, header=True, overwrite=True)

# this works
snowpark_session.read.parquet(full_name)

# this fails
fs = sfcfs.SFFileSystem(snowpark_session=snowpark_session)
file_name = fs.ls(snowpark_session.get_session_stage())[0]
pandas.read_parquet(fs.open(file_name)) 

the error message I get is:

ArrowInvalid: Could not open Parquet input source '': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.


Solution

  • the issue appears to be with the snowflake file system object - there's an alternative api that does work:

    pandas.read_parquet(snowpark_session.file.get_stream(file_name))