We have MS SQL server as a primary option for various databases and we run hundreds of Stored procedures on a regular basis. Now we are moving to completely big data stack. We are using Spark for the batch jobs. But, We have already invested enormous effort in creating those stored procedure. Is there a way to reuse the stored procedure on top of Spark? or is there an easy way to migrate them to Spark instead of writing from scratch?
Or any framework like Cloudera distribution/impala addresses this requirement?
No, there's not as far as I can tell. You may be able to use a very similar logical flow but you're going to need to invest serious time and effort to convert the T-SQL to Spark. I would recommend going straight to Scala and not wasting time with Python/PySpark.
My rule of thumb for the conversion would be to try to do anything that's SQL in the stored procs as SQL in Spark (sqlContext.sql("SELECT x FROM y")
) but be aware that Spark DataFrames are immutable so any UPDATE
or DELETE
actions will have to be changed to output a new modified DataFrame.