Search code examples
amazon-web-servicesamazon-s3databricksspark-structured-streaming

DataBricks Auto loader vs Input source files deletion detection


While ingesting files from a source s3 bucket continuously, I would like to be able to detect the case where files are being deleted. As far as I can tell the Autoloader can not handle the detection of files deleted in the source folder. Hence the case can't be supported. I want to confirm that first, and if it is indeed the case, inquire about the approach or work around that people use to handle that scenario. 


Solution

  • According to the Databricks documentation at this time, no, AutoLoader triggers on object creation only (e.g. ObjectCreated events) and therefore does not support detection of deleted files; neither does Databricks Workflows through a means such as File Arrival Trigger.

    The ideal solution will depend on what your objective is to do with the deleted files. However, a possible workaround generically speaking would be to create your own AWS Lambda Function that triggers off of the s3:ObjectRemoved:* events (you can trigger Lambda functions using S3 Event Notifications). Depending on what you need to do with the deleted file, you may prefer to do the processing entirely in this Lambda function. Or you may implement the Lambda function to simply copy the file to a different location, which you could have a Databricks workflow process either using AutoLoader or File Arrival Trigger.

    • (option 1): S3 Event Notification --> Lambda (processing)
      • This is the leanest, with fewer moving parts and is more cost-effective, but may limit you in terms of processing power and Spark capabilities.
    • (option 2): S3 Event Notification --> Lambda --> copy file to another S3 location --> Databricks Job processes with File Arrival Trigger
      • This is more over-engineered since Lambda is already event-driven, but perhaps you need longer processing or capabilities only available on Databricks, so you may still use this option.
    • (option 3): S3 Event Notification --> Lambda --> copy file to another S3 location --> Databricks Job processes with AutoLoader
      • This would be ideal if you want to batch process your deleted files from the quarantine, and have access to the full capabilities of Databricks or Spark.