We can write spark code/transformations using RDD (low level API), Dataframe, SQL. As per my understanding dataframe/SQL is more performant (due to tungsten, catalyst optimizer) than low level API(RDD), hence it is recommended to use Dataframe/SQL.
Internally spark converts all the code to RDD. So even if we write Dataframes, internally it is converted to RDD. So how is using high level APIs beneficial?
There is the Spark Optimizer - Catalyst
- which applies optimization strategies to DF's or DS's. Not to RDD's. In addition you process a whole row / tuple whatever you want to call it with an RDD, not so with DF's or DS's, they can be used by Spark in a columnar fashion.