Search code examples
amazon-web-servicesaws-glue

AWS Glue Crawler creating table for non-existent file


I have a glue crawler that is setup to crawl a bucket named internal-data-prod (s3://internal-data-prod). Last night someone dropped a CSV file into not only root level (s3://internal-data-prod/data.csv) but also a folder level down (s3://internal-data-prod/folder/data.csv).

The crawler ran the first time when the file was dropped at the top level, but the columns were wrong.

They "deleted" the file (versioning is enabled, more on this in a minute) and then reloaded it under the folder. The columns were still wrong because versioning was still enabled and the version history not deleted so it scanned the root level file first. Then they went and got a parquet file for the same data from another account. This data was crawled and almost everything is fine, except it added _parquet to the table name.

The problem I have now is that I have removed the files and version history under that bucket and the folder, but the _csv table is still being re-created when I crawl the bucket. I removed the tables from the database and everything prior to re-crawling. The data source when it recrawls is showing s3://internal-data-prod/data.csv which doesn't exist as far as I can see (show version history button toggled to show)

Why is it behaving this way and how do I fix it?


Solution

  • I wish I had a better answer. Here is what I ended up doing -

    1. Cleared out the entire bucket (minimal data was here anyway) and waited an hour
    2. Reloaded the data.parquet file to root bucket
    3. Table Name wrong : data_parquet
    4. Deleted the file extension / table
    5. Recrawl
    6. Ooops put the file in the root bucket .. Redshift spectrum doesn't like that
    7. Move file to folder in s3 named data
    8. Recrawl