Search code examples
amazon-web-servicesamazon-s3aws-glueaws-glue-spark

AWS Glue Exclude Patterns


I am working on a project which is using Glue 3.0 & PySpark to process large amounts of data between S3 buckets. This is being achieved using GlueContext.create_dynamic_frame_from_options to read the data from an S3 bucket to a DynamicFrame using the recurse connection option set to True as the data is nested heavily. I only wish to read files which end in meta.json therefore I have set the exclusions filter to exclude any files which end in data.csv "exclusions": ['**.{txt, csv}', '**/*.data.csv', '**.data.csv', '*.data.csv'] however I am consistently getting the following error:

An error occurred while calling o90.pyWriteDynamicFrame. Unable to parse file: <filename>.data.csv

Is it possible to log the full S3 uri to the output logs or keep a track of the files which have/have not been processed? What is the reason it is still trying to parse this file even though it is included in the exclusions?


Solution

  • Exclusions has to be a string

    "exclusions": "[\"**/*.txt\", \"**/*.csv\"]",