I am reading data via
glueContext.create_data_frame.from_catalog(database = "db", table_name = "ta")
from parquet files on a s3 bucket.
Unfortunately, it seems the bucket contains a non-parquet file (last_ingest_partition
) which causes the following error:
An error occurred while calling o92.getDataFrame. s3://cdh/measurements/ta/last_ingest_partition is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 45, 49, 50]
Is there a possibility to exclude this file from being read? I have tried somethig like
glueContext.create_data_frame.from_catalog(database = "db", table_name = "ta", additional_options={"exclusions" : "[\"**last_ingest_partition\""})
but this does not work for me.
Here is what I have found out and what solved my problem:
create_dynamic_frame.from_catalog
instead of create_data_frame.from_catalog
and added a .toDF()
afterwards, everything worked fine for me.create_dynamic_frame
I could also use exlusions as additional options: .create_dynamic_frame.from_catalog(database = "testdb1", table_name = "cxexclude",additional_options={"exclusions": "[\"**{json,parquet}**\"]"})
create_data_frame class
, there are limitations: Spark DataFrame partition filtering doesn't work.