Search code examples
amazon-web-servicesaws-glueglob

Why does my Glue Crawler exclude pattern not apply?


I know that this has been asked before. But I have spent hours trying to get this to work.

I have a directory structure like:

- datalake
--- datasets
----- foo
------- 00001.json
------- 00002.json
------- latest.json
----- bar
------- 00001.json
------- latest.json

my include path looks like

s3:<bucket_name>/datalake/datasets/

i want to exclude things that are not latest.jsons

I have tried everything under the sun.

**0*
**/0**
*/0*
*0*
**0**

and many others.

Without fail, my crawler catalogs every .json.

I am checking the results of my crawl with Athena.

Am I seriously getting the exclude pattern wrong? Or am I somehow thinking about this entire thing the wrong way and my pattern is irrelevant?


Solution

  • For me, the answer ended up being related to the fact that I was using Athena to look at the updated catalog. According to this:

    https://docs.aws.amazon.com/athena/latest/ug/troubleshooting-athena.html#troubleshooting-athena-data-file-issues

    Athena will not respect the exclusion of glue files.