Search code examples
apache-sparkpyspark

Performance - RDD vs High level APIs (dataframes)


We can write spark code/transformations using RDD (low level API), Dataframe, SQL. As per my understanding dataframe/SQL is more performant (due to tungsten, catalyst optimizer) than low level API(RDD), hence it is recommended to use Dataframe/SQL.

Internally spark converts all the code to RDD. So even if we write Dataframes, internally it is converted to RDD. So how is using high level APIs beneficial?


Solution

  • There is the Spark Optimizer - Catalyst - which applies optimization strategies to DF's or DS's. Not to RDD's. In addition you process a whole row / tuple whatever you want to call it with an RDD, not so with DF's or DS's, they can be used by Spark in a columnar fashion.