hadoop apache-spark bigdata apache-spark-1.5

What is oozie equivalent for Spark?

We have very complex pipelines which we need to compose and schedule. I see that Hadoop ecosystem has Oozie for this. What are the choices for Spark based jobs when I am running Spark on Mesos or Standalone and doesn't have a Hadoop cluster?

Solution

Unlike with Hadoop, it is pretty easy to chains things with Spark. So writing a Spark Scala script might be enough. My first recommendation is tying that.

If you like to keep it SQL like, you can try SparkSQL.

If you have a really complex flow, it is worth looking at Google data flow https://github.com/GoogleCloudPlatform/DataflowJavaSDK.

Change block size of dfs file
Map Reduce Job Failing with OOM [org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster]
Unable to access Hadoop CLI after enabling Kerberos
How to check if Hadoop daemons are running?
hive -e with delimiter
Does mapreduce program consumes all the files (input dataset) in a folder by default?
Upgrading hadoop to 3.1.2 with hbase-testing-utility 2.2.3
How to understand the result of yarn queue status
Spark: what options can be passed with DataFrame.saveAsTable or DataFrameWriter.options?
Ambari 2.0 installation fails, "<urlopen error [Errno 111] Connection refused>"
Getting java.lang.UnsatisfiedLinkError when trying to run my Code
Hadoop HDFS - Difference between Missing replica and Under replicated blocks
Datanode having trouble with JVM pausing
Apache Crunch Job On AWS EMR using Oozie
How to turn off INFO logging in Spark?
run hadoop ERROR: JAVA_HOME /usr/bin/java does not exist
Hadoop start-all.cmd command : datanode shutting down
MacOS Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Hadoop - namenode is not starting up
how t restore a hdfs deleted file
Sqoop Import HBase - SQL Database
Spark Streaming - Refresh Static Data
How to copy and convert parquet files to csv
How to read Parquet file from S3 without spark? Java
Spark - load CSV file as DataFrame?
Apache Spark: how to cancel job in code and kill running tasks?
BDB0091 DB_VERSION_MISMATCH: Database environment version mismatch with Ambari 2.4.2
"The machine with the name 'c6401' was not found configured for this Vagrant environment." Error
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Where does Big Data go and how is it stored?