Are there any advantages/disadvantages of using Dataframes over pairRDDs when performing joins in Spark. In other words are there any join optimizations that you can only do with pairRDDs and not dataframes?
A three way join with (3) RDD's needs to be done with 2 JOINs using k,v approach. That is cumbersome and cannot be optimized, just in order of join RDD's.
The same with DF's can be done as one query and using stats optimizations can be applied in terms of JOIN order using stats or on the fly with Spark 3.
RDD's --> very painful for JOINs.