Search code examples
javaapache-sparkapache-spark-sqlspark-thriftserver

What is the best way to query data stored in HDFS using Spark?


I would like to create a Java app that queries data in HDFS using Spark. Until now, I've tested doing this in 2 ways: - making SQL queries to the JDBC endpoint exposed by the Thrift server (started with the default configurations) - using the spark dataset api

My question is, being completely new to hadoop/spark, which of the 2 ways would be most efficient and easier to set up ( without the default configurations)?

From what I understand until now, using Thrift server requires configuration and maintainance of Thrift and Hive. On the other hand, I expect that using the dataset API would be slower and with more limitations, keeping the data in memory.


Solution

  • The thrift server does require a slight bit more configuration and required a hive metastore to keep table definitions, you youre able to query everything using sql. At the end of the day, the performance between running a thrift server query and a query using the untyped dataset api is basically the same, functionally you have more flexibility with the dataset api. The strongly typed dataset apis are less performant than the untyped dataset api due to the codegen spitting out bad code (especially pre-spark 2.2).