I have a glue crawler that is setup to crawl a bucket named internal-data-prod (s3://internal-data-prod). Last night someone dropped a CSV file into not only root level (s3://internal-data-prod/data.csv) but also a folder level down (s3://internal-data-prod/folder/data.csv).
The crawler ran the first time when the file was dropped at the top level, but the columns were wrong.
They "deleted" the file (versioning is enabled, more on this in a minute) and then reloaded it under the folder. The columns were still wrong because versioning was still enabled and the version history not deleted so it scanned the root level file first. Then they went and got a parquet file for the same data from another account. This data was crawled and almost everything is fine, except it added _parquet to the table name.
The problem I have now is that I have removed the files and version history under that bucket and the folder, but the _csv table is still being re-created when I crawl the bucket. I removed the tables from the database and everything prior to re-crawling. The data source when it recrawls is showing s3://internal-data-prod/data.csv which doesn't exist as far as I can see (show version history button toggled to show)
Why is it behaving this way and how do I fix it?
I wish I had a better answer. Here is what I ended up doing -