Search code examples
databricks-autoloader

databricks autoLoader - why new data is not write to table when original csv file is deleted and new csv file is uploaded


I have a question about autoload writestream.

I have below user case:

  1. Days before I uploaded 2 csv files into databricks file system, then read and write it to table by autoloader.
  2. Today, I found that the files uploaded days before has wrong data that faked. so I deleted these old csv file, and then uploaded 2 new correct csv file.
  3. Then I read and write the new files by autoloader streaming.

I found that the streaming can read the data from new files successfully, but failed to write to table by writestream.

Then I tried to delete the checkpoint folder and all sub folders or files and re-create the checkpoint folder, and read and write by stream again, found that the data is write to table successfully.

Questions: Since the autoloader has detect the new files, why it can't write to table succesfully until I delete the checkpoint folder and create a new one.


Solution

  • AutoLoader works best when new files are ingested into a directory. Overwriting files might give unexpected results. I haven't worked with the option cloudFiles.allowOverwrites set to True yet, but this might help you (see documentation below).

    On the question about readStream detecting the overwritten files, but writeStream not: This is because of the checkpoint. The checkpoint is always linked to the writeStream operation. If you do

    df = (spark.readStream.format("cloudFiles")
          .option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
          .load("filepath"))
    display(df)
    

    then you will always view the data of all the files in the directory. If you use writeStream, you need to add .option("checkpointLocation", "<path_to_checkpoint>"). This checkpoint will remember that the files (that were overwritten) already have been processed. This is why the overwritten files will only be processed again after you deleted the checkpoint.

    Here is some more documentation about the topic: