Search code examples
amazon-web-servicesaws-cloudformationaws-glue

How to set the Glue Crawler RecrawlPolicy in my CF template


I would like to set my glue crawler to only crawl new folders in my s3 bucket. Based on documentation, it looks like I want to set the RecrawlBehavior to CRAWL_NEW_FOLDERS_ONLY. But I can't find any guidance on how to do that in a CloudFormation template.

This is my crawler's configuration property now, but my use of RecrawlBehavior is invalid:

Configuration: "{\"Version\":1.0,\"RecrawlBehavior\":\"CRAWL_NEW_FOLDERS_ONLY\",\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

Solution

  • As per my understanding, Incremental policy is a relatively new feature in Glue and not supported in Cloud Formation yet.

    A workaround I can suggest to overcome this limitation is creating a crawler using cloudformation and then use AWS CLI to update its RecrawlPolicy property.

    When you create a crawler using cloudformation and try to retrieve its properties using CLI, RecrawlPolicy" has "RecrawlBehavior" set to "CRAWL_EVERYTHING". You can use the below command to change it to incremental crawls (Crawl new folders only).

    aws glue update-crawler 
        --name <crawlername> 
        --recrawl-policy '{"RecrawlBehavior": "CRAWL_NEW_FOLDERS_ONLY"}' 
        --schema-change-policy '{"UpdateBehavior":"LOG","DeleteBehavior":"LOG"}'