Search code examples
hadoophiveapache-sparkshark-sql

Can someone explain this : "Spark SQL supports a different use case than Hive."


I am referring to the following link : Hive Support for Spark

It says :

"Spark SQL supports a different use case than Hive."

I am not sure why that will be the case. Does this mean as a Hive user i cannot use Spark execution engine through Spark SQL?

Some Questions:

  • Spark SQL uses Hive Query parser. So it will ideally support all of Hive functionality.
  • Will it use Hive Metastore?
  • Will Hive use the Spark optimizer or will it build its own optimizer?
  • Will Hive translate MR Jobs into Spark? Or use some other paradigm?

Solution

  • Spark SQL is intended to allow the use of SQL expressions on top of Spark's machine learning libraries. It allows you to use SQL as a tool (among others) for building advanced analytic (eg ML) applications. It is not a drop-in replacement for Hive, which is really best at batch processing/ETL.

    However, there is also work ongoing upstream to allow Spark to serve as a general data processing backend for Hive. That work is what would allow you to take full advantage of Spark for Hive use cases specifically.