Search code examples
dataframeapache-sparkapache-spark-sqlrddapache-spark-dataset

Which is better among RDD, Dataframe, Dataset for doing avro columnar operations in spark?


We have a use case where we need to do some columnar transformations on avro datasets. We used to run MR jobs till now and now want to explore spark. I am going through some tutorials and am not sure whether we should use RDD or Dataframe/Dataset. Since Dataframes are stored columnar, is it a right choice to use Dataframes, as all my transformations are columnar in nature? Or does it not make much difference as internally everything is based on RDDs?


Solution

  • From a performance standpoint, your data format won't have any effect on the API you're using to describe the transformations.

    I would advise going with the most high-level API possible (DataFrames), and only switching to RDDs if some operation you need can't be implemented in any other way.