Search code examples
apache-sparkdatasetrdd

When I should use RDD instead of Dataset in Spark?


I know that I should use Spark Datasets primarily, however I am wondering if there are good situations where I should use RDDs instead of Datasets?


Solution

  • In a common Spark application you should go for the Dataset/Dataframe. Spark internally optimize those structure and they provide you high level APIs to manipulate the data. However there are situation when RDD are handy:

    • When manipulating graphs using GraphX
    • When integration with 3rd party libraries that only know how to handle RDD
    • When you want to use low level API to have a better control over your workflow (e.g reduceByKey, aggregateByKey)