Search code examples
apache-sparkhivetez

what's the purpose of spark sql over hive?


I know the question had been asked years ago, but I am still wondering the true purpose of using SparkSQL / HiveContext.

Spark approach gives a more generic distributed way that the built-in MapReduce.

I read a lot of articles claiming that MR way is already dead and Spark is the best (I understand that I can implement an MR approach through Spark).

When it is recommended to query data using HiveContext, I am a little bit confused.

Indeed, running a query from SparkSQL/HiveContext doesn't it imply running a MR job ? Isn't it to back to the main problematic ? TEZ isn't it enought if I don't need to encapsulate the query result in more complex code ?

Am I wrong (I am sure I am :-)) ?


Solution

  • Indeed, running a query from SparkSQL/HiveContext doesn't it imply running a MR job ?

    It does not. In fact using HiveContext or SparkSession with "Hive support" doesn't imply any connection to Hive, other than using Hive metastore. This approach is used by many other systems, both ETL solutions and databases.

    Finally:

    • Hive is a database with modular components. It supports relatively rich permissions system, mutations and transactions.
    • Spark is general purpose processing engine. Despite having SQL-ish component it doesn't attempt to be a database.