Search code examples
directoryschemaparquetaws-glue

AWS GLUE job failure working with partitioned Parquet files in nested s3 folders


I get the following error when running a GLUE job over partitioned parquet files Unable to infer a schema for Parquet. It must be specified manually

I have set up my crawler and successfully obtained the schema for my parquet files. I can view the data in Athena. I have created the schema manually on my target Redshift.

I can load the files via GLUE into Redshift if all my data is in one folder only. BUT when I point at a folder that has nested folders, e.g. folder X - has 04 and 05 - the GLUE job fails with the message Unable to infer a schema for Parquet. It must be specified manually

Which is strange as it works if I put all these files into the same folder?


Solution

  • I found a solution here - this works for me Firehose JSON -> S3 Parquet -> ETL Spark, error: Unable to infer schema for Parquet

    It is the scala version of the ETL glue job