Search code examples
apache-sparkscheduled-tasksibm-cloudspark-notebook

Scheduler for jobs executing Apache Spark SQL on Bluemix


I am using Apache Spark in Bluemix. I want to implement scheduler for sparksql jobs. I saw this link to a blog that describes scheduling. But its not clear how do I update the manifest. Maybe there is some other way to schedule my jobs.


Solution

  • The manifest file is to guide the deployment of cloud foundry (cf) apps. So in your case, sounds like you want to deploy your cf app that acts as a SparkSQL scheduler and use the manifest file to declare that your app doesn't need any of the web app routing stuff, or anything else for user-facing apps, because you just want to run a background scheduler. This is all well and good, and the cf docs will help you make that happen.

    However, you cannot run a SparkSQL scheduler for the Bluemix Spark Service today because it only supports Jupyter notebooks through the Data-Analytics section of Bluemix; i.e., only a notebook UI. You need a Spark API you could drive from your scheduler cf app; e.g. spark-submit type thing where you can create your Spark context and then run programs, like SparkSQL you mention. This API is supposed to be coming to the Apache Spark Bluemix service.

    UPDATE: spark-submit was made available sometime around the end of 1Q16. It is a shell script, but inside it makes REST calls via curl. REST API doesn't seem to yet be supported, but either you could call the script in your scheduler, or take the risk of calling the REST API directly and hope it doesn't changes and break you.