I know the question had been asked years ago, but I am still wondering the true purpose of using SparkSQL / HiveContext.
Spark approach gives a more generic distributed way that the built-in MapReduce.
I read a lot of articles claiming that MR way is already dead and Spark is the best (I understand that I can implement an MR approach through Spark).
When it is recommended to query data using HiveContext, I am a little bit confused.
Indeed, running a query from SparkSQL/HiveContext doesn't it imply running a MR job ? Isn't it to back to the main problematic ? TEZ isn't it enought if I don't need to encapsulate the query result in more complex code ?
Am I wrong (I am sure I am :-)) ?
Indeed, running a query from SparkSQL/HiveContext doesn't it imply running a MR job ?
It does not. In fact using HiveContext
or SparkSession
with "Hive support" doesn't imply any connection to Hive, other than using Hive metastore. This approach is used by many other systems, both ETL solutions and databases.
Finally: