Search code examples
apache-sparkhadoopetlimpala

Impala shell or Spark for ETL?


I have recently started working on Hadoop environment. I needed to do some basic ETL to populate few tables. Currently I am importing data into Hadoop using sqoop and using Impala shell command to write SQL queries for transformations.

But I am hearing about Spark a lot these days. In my situation will I have any advantage writing my ETL in Spark instead of Impala shell?

Thanks S


Solution

  • Many people in the past were using either A) SQL Scripts (like Impala) with UNIX scripts or using B) ETL tools for ETL.

    However, the question is 1) more of scale imo and 2) standardizing on technologies.

    Since Spark is being used, then why not standardize on Spark?

    I have been thru this cycle and Kimball DWH processing can be done quite OK with Spark. It means less costs in terms of paid ETL tools like Informatica. But there are community editions.

    Some points to note:

    • Saving of file to different HDFS formats is easier and more direct with Data Frame Writer etc.
    • But Informatica-like mappings with branches are a little different.
    • Performance at scale will be better with Spark once data gotten form external sources.
    • File control is easier with UNIX scripting than inside Spark imo, but it is a case of getting used to if done within Spark.
    • Sqoop can be obviated and you can use JDBC DF Reader of Spark, but there is no reason to dispense with sqoop, although I would use Confluent Kafka Connect instead with higher latency, but then we get into Zen Questions as Kafka is for more real-time aspects.
    • I am not convinced overall of the benefits of ETL tools.

    With cost reductions that IT needs to under go, Spark is a good option. But it is not for the faint-hearted, you need to be a good programmer. That is what I hear many people saying.