Search code examples
amazon-web-servicesaws-glueaws-glue-data-catalog

AWS Glue Crawlers - How to handle large directory structure of CSVs that may only contain strings


Been at this for a few days and any help is greatly appreciated.

Background: I am attempting to create 1+ glue crawlers to crawl the following S3 "directory" structure:

.
+-- _source1
|   +-- _item1
|   |   +-- _2019  #year
|   |   |   +-- _08  #month
|   |   |   |   +-- _30  #day
|   |   |   |   |   +-- FILE1.csv  #files
|   |   |   |   |   +-- FILE2.csv
|   |   |   |   +-- _31
|   |   |   |   |   +-- FILE1.csv
|   |   |   |   |   +-- FILE2.csv
|   |   |   +-- _09
|   |   |   |   +-- _01
|   |   |   |   +-- _02
|   +-- _item2
|   |   +-- _2019
|   |   |   +-- _08
|   |   |   |   +-- _30
|   |   |   |   +-- _31
|   |   |   +-- _09
|   |   |   |   +-- _01
|   |   |   |   +-- _02
+-- _source2
|   +-- ....
........  # and so on...

This goes on for several sources, each with potentially 30+ items, each of which has the year/month/day directory structure within.

All files are CSVs, and files should not change once they're in S3. However, the schemas for the files within each item folder may have columns added in the future.

  • 2019/12/01/FILE.csv may have additional columns compared to 2019/09/01/FILE.csv.

What I've Done:

In my testing so far, crawlers created at source level directories (see above) have worked perfectly as long as no CSV only contains string-type columns.
This is due to the following restriction, as stated in the AWS docs:

The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.

Normally, I'd imagine you could get around this by creating a custom classifier that expects a certain CSV schema, but seeing as I may have 200+ items (different schemas) to crawl, I'd like to avoid this.


Proposed Solutions:

  1. Ideally, I'd like to force my crawlers to interpret the first row of every CSV as a header, but this doesn't seem possible...
  2. Add a dummy INT column to every CSV to force my crawlers to read the CSV headers, and delete/ignore the column down the pipeline. (Seems very hackish)
  3. Find another file format that works (will require changes throughout my ETL pipeline)
  4. DON'T USE GLUE

Thanks again for any help!


Solution

  • Found the issue: Turns out in order for an updated glue crawler classifier to take effect, a new crawler must be created and have the updated classifier applied. As far as I can tell this is not explicitly mentioned in the AWS docs, and I've only seen mention of it over on github

    Early on in my testing I modified an existing csv classifier that specifies "Has Columns", but never created a new crawler to apply my modified classifier to. Once I created a new crawler and applied the classifier, all data catalog tables were created as expected regardless of column types.

    TL;DR: Modified classifiers will not take effect unless they are applied to a new crawler. Source