Performance - RDD vs High level APIs (dataframes)

We can write spark code/transformations using RDD (low level API), Dataframe, SQL. As per my understanding dataframe/SQL is more performant (due to tungsten, catalyst optimizer) than low level API(RDD), hence it is recommended to use Dataframe/SQL.

Internally spark converts all the code to RDD. So even if we write Dataframes, internally it is converted to RDD. So how is using high level APIs beneficial?

Solution

There is the Spark Optimizer - Catalyst - which applies optimization strategies to DF's or DS's. Not to RDD's. In addition you process a whole row / tuple whatever you want to call it with an RDD, not so with DF's or DS's, they can be used by Spark in a columnar fashion.

Replace dataframe with its alias in select in pyspark
Get difference between two version of delta lake table
Get timestamp in PySpark from GregorianCalendar
How to write a condition based on multiple values for a DataFrame in Spark
Convert RDD of LabeledPoint to DataFrame toDF() Error
Rowencoder.apply and rowencoder.encoderfor methods in spark catalyst package
How to deal with executor memory and driver memory in Spark?
No FileSystem for scheme: s3 with pyspark
How to throw Exception in Databricks?
Spark - SELECT WHERE or filtering?
Are Spark checkpoints invalidated when source data is changed?
How to read the json file in spark using scala?
Unable to start thrift-server due to class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter
filter only not empty arrays dataframe spark
Escape a single quote in plain Databricks SQL
How can I get all names of the arrays on Dataframe
create a Spark DataFrame from a nested array of struct element?
Executing multiple SQL queries on Spark - Table or view not found
spark UI - Understand metrics memory used
Spark: Trying to run spark-shell, but get 'cmd' is not recognized as an internal or
Remove list elements in a dataframe in scala
Not able to Explode and select in the same expression in spark scala
Fetching data from REST API to Spark Dataframe using Pyspark
Count entries for all possible categories
Create column using Spark pandas_udf, with dynamic number of input columns
How to find position of substring column in another column using PySpark?
How to correctly read a CSV file while escaping delimiter comma placed within square brackets using Apache Spark and Scala?
SPARK SQL Equivalent of Qualify + Row_number statements
How to drop a column from a Databricks Delta table?
Converting all columns in spark df from decimal to float for pandas conversion