Search code examples
apache-sparkspark-streaming

how to find files processed in spark file streaming


i have a structured streaming app set up which is monitoring a folder in blob storage for new files and does the processing on them. It works well and i can monitor and cluster health, see the incoming records, output records etc. etc. But i really want to see if there is any log which says file name that got processing, or x number of records from this file gets processed.

any pointers will be helpful.


Solution

  • The file names that were processed are saved in the stream's configured checkpoint such .option("checkpointLocation", "dbfs://checkpointPath").

    For monitoring how many input rows were actually processed by the stream, look into StreamingQueryListener.