When to Use RDD And DataFrame in Spark

From what I read, RDD can not take advantages of optimization Spark has for structured data as DataFrame is able to, does it justify that when dealing with unstructured data sources we should use RDD, while dealing with structured data source like a a table from a database we should use DataFrame? And how about semi-structured data like json? Which abstraction should we adopt? RDD or DataFrame?

Solution

RDD

RDD is legacy and will disappear. It cannot be optimized like DFs, DS's. It is row based. It has 1 or 2 handy features still: a) the use if putting an ascending sequence number via zipWithIndex and 2) if you want your custom partitioning. JOINs are terrible, successive (key, value) pair joins entailing a lot of manipulation. rdd saving of the data to "data at rest" are limited. You tend to convert to DF for that.

DF, DS

DF and DS are columnar structures (DS not for pyspark, but arrow support) that can be optimized via Catalyst to produce better plans. JOIns are easier and more JSON like inference and support for semis-structured data with an SQL-like support, meaning likely more people other than data engineers can get in on the act - maybe. DF's have good read and write support (from / to) Hadoop or JDBS databases.

DS has type safety enforcement but still some issues, but you did not ask that.

You can also consult this blog https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html, but I am not entirely convinced on all perspectives mentioned. It's my opinion though.