Search code examples
apache-sparkspark-streaming

Run a spark command automatically


I have an object in spark scala that reads an HDFS file and export it in a local file, within my cluster. I created the function, with an object, I created a SparkSession and the function correctly returns what I want with the following command:

ReadFiles.main(Array("hdfs://.../info.log"))

But I wanted this function to run every 5 minutes. Is there a way to execute the command every 5 minutes? Or else create some variable in SparkSession function that does?

Thanks


Solution

  • You can go ahead with threads as below.

    import java.util.concurrent.Executors
    import java.util.concurrent.TimeUnit.SECONDS
    
    Executors.newSingleThreadScheduledExecutor.scheduleWithFixedDelay(fileReaderThread(), 0L, 300L, SECONDS)
    
      def fileReaderThread() = new Runnable {
        override def run(): Unit = {
          ReadFiles.main(Array("hdfs://.../info.log"))
        }
      }
    

    Call newSingleThreadScheduledExecutor in a separate main only once. Later it will keep on calling your read files method in a fixed time.