I have recently started working on Hadoop environment. I needed to do some basic ETL to populate few tables. Currently I am importing data into Hadoop using sqoop and using Impala shell command to write SQL queries for transformations.
But I am hearing about Spark a lot these days. In my situation will I have any advantage writing my ETL in Spark instead of Impala shell?
Thanks S
Many people in the past were using either A) SQL Scripts (like Impala) with UNIX scripts or using B) ETL tools for ETL.
However, the question is 1) more of scale imo and 2) standardizing on technologies.
Since Spark is being used, then why not standardize on Spark?
I have been thru this cycle and Kimball DWH processing can be done quite OK with Spark. It means less costs in terms of paid ETL tools like Informatica. But there are community editions.
Some points to note:
With cost reductions that IT needs to under go, Spark is a good option. But it is not for the faint-hearted, you need to be a good programmer. That is what I hear many people saying.