I experienced an issue in production with the AWS Crawler that scans an S3 bucket which is an data export from Amazon RDS.
For context, here is how my pipeline is built:
I started having the error from February the 1st, 2025. Here is the message displayed in the Crawler logs (with mocked names):
[id] INFO : Some files do not match the schema detected.
Remove or exclude the following files from the crawler (truncated to first 200 files):
bucket-name/folder/subfolder/table.name/1/_SUCCESS
The first problem is that the crawler does not detect that as an error and only logs it as INFO. Nothing changed in my application code or infrastructure code (the terraform plan did not intend to change anything so there was no change on the crawler itself).
The real issue was that the crawler was mistaking .parquet partitions and _SUCCESS flags in the S3 bucket for tables.
For example, if my parquet file was here:
s3://bucket-name/folder/subfolder/table.name/1/part-000-1234.parquet
Instead of creating the table "table.name" in the Glue Database, it creates a table called "part-000-1234.parquet" or sometimes creates one using the success flag followed by an id like "_sucess_840193" (generated by the S3 export from RDS).
List of checks are ran:
Here is what my crawler looked like:
resource "aws_glue_crawler" "database-crawler" {
name = "crawler-name"
database_name = "database"
role = aws_iam_role.iam_role
s3_target {
path = "s3://bucket/"
}
}
After talking to the AWS Support for half a day, I found the solution. The issue comes from a change on their side that hasn't been documented in the AWS Documentation or in a tech blog or status page. I don't know what they have changed but apparently some other clients experimented the same error and I couldn't find any post about it so here's the solution to fix this. Here is what fixed the problem for us:
Exclude the file _SUCCESS from the crawled data. You can do that by adding the exclusions argument in your terraform code:
s3_target {
path = "s3://bucket/"
exclusions = ["**/_SUCCESS"]
}
You can also do this in the AWS console by adding this **/_SUCCESS directly in the excluded files to crawl.