Search code examples
apache-sparkhdfsstreamingspark-streaming

Can output files be moved while doing spark streaming, without crashing the spark job?


I have a Structured Streaming Spark Job running with Kafka as source, outputting orc files in append mode. While the job is running, I'm moving the files (want to) to an hdfs location every certain time. By moving the files, will the spark job ever crash or produce bad output as a result? Once spark writes the file, will it ever look at the file again for any reason? I want to perform files move but I don't want to disrupt spark in any way.


Solution

  • As you are appending the data moving the files won't affect your structured streaming job as long as _spark_metadata directory which gets generated in your output folder and the checkpoint directory remains in sync.