Search code examples
pysparkaws-glue

Exclude files based on name when calling from_catalog


I am reading data via

glueContext.create_data_frame.from_catalog(database = "db", table_name = "ta")

from parquet files on a s3 bucket. Unfortunately, it seems the bucket contains a non-parquet file (last_ingest_partition) which causes the following error: An error occurred while calling o92.getDataFrame. s3://cdh/measurements/ta/last_ingest_partition is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 45, 49, 50]

Is there a possibility to exclude this file from being read? I have tried somethig like

glueContext.create_data_frame.from_catalog(database = "db", table_name = "ta", additional_options={"exclusions" : "[\"**last_ingest_partition\""})

but this does not work for me.


Solution

  • Here is what I have found out and what solved my problem:

    1. When I switch my code to create_dynamic_frame.from_catalog instead of create_data_frame.from_catalog and added a .toDF() afterwards, everything worked fine for me.
    2. For create_dynamic_frame I could also use exlusions as additional options: .create_dynamic_frame.from_catalog(database = "testdb1", table_name = "cxexclude",additional_options={"exclusions": "[\"**{json,parquet}**\"]"})
    3. For create_data_frame class, there are limitations: Spark DataFrame partition filtering doesn't work.