Search code examples
amazon-web-servicesaws-glueglue-crawleraws-glue-crawler

AWS Glue Crawler issue with S3 export from RDS


I experienced an issue in production with the AWS Crawler that scans an S3 bucket which is an data export from Amazon RDS.
For context, here is how my pipeline is built:

  1. Create a Snapshot from RDS database
  2. Export the Snapshot to S3
  3. Crawl the S3 bucket to load the tables in Glue Database.

I started having the error from February the 1st, 2025. Here is the message displayed in the Crawler logs (with mocked names):

[id] INFO : Some files do not match the schema detected. 
Remove or exclude the following files from the crawler (truncated to first 200 files):
bucket-name/folder/subfolder/table.name/1/_SUCCESS

The first problem is that the crawler does not detect that as an error and only logs it as INFO. Nothing changed in my application code or infrastructure code (the terraform plan did not intend to change anything so there was no change on the crawler itself).

The real issue was that the crawler was mistaking .parquet partitions and _SUCCESS flags in the S3 bucket for tables.
For example, if my parquet file was here: s3://bucket-name/folder/subfolder/table.name/1/part-000-1234.parquet

Instead of creating the table "table.name" in the Glue Database, it creates a table called "part-000-1234.parquet" or sometimes creates one using the success flag followed by an id like "_sucess_840193" (generated by the S3 export from RDS).

List of checks are ran:

  • I checked the version of my installed lib in the pipeline and nothing changed between January and February.
  • The infra or application code did not change.
  • There was no schema changes on RDS.
  • I crawled data from January to see if the issue came from the data and encountered the same error.

Here is what my crawler looked like:

resource "aws_glue_crawler" "database-crawler" {
  name = "crawler-name"

  database_name = "database"
  role          = aws_iam_role.iam_role

  s3_target {
    path       = "s3://bucket/"
  }
}

Solution

  • After talking to the AWS Support for half a day, I found the solution. The issue comes from a change on their side that hasn't been documented in the AWS Documentation or in a tech blog or status page. I don't know what they have changed but apparently some other clients experimented the same error and I couldn't find any post about it so here's the solution to fix this. Here is what fixed the problem for us:

    Exclude the file _SUCCESS from the crawled data. You can do that by adding the exclusions argument in your terraform code:

      s3_target {
        path       = "s3://bucket/"
        exclusions = ["**/_SUCCESS"]
      }
    

    You can also do this in the AWS console by adding this **/_SUCCESS directly in the excluded files to crawl.