dataframe apache-spark apache-spark-sql rdd apache-spark-dataset

Which is better among RDD, Dataframe, Dataset for doing avro columnar operations in spark?

We have a use case where we need to do some columnar transformations on avro datasets. We used to run MR jobs till now and now want to explore spark. I am going through some tutorials and am not sure whether we should use RDD or Dataframe/Dataset. Since Dataframes are stored columnar, is it a right choice to use Dataframes, as all my transformations are columnar in nature? Or does it not make much difference as internally everything is based on RDDs?

Solution

From a performance standpoint, your data format won't have any effect on the API you're using to describe the transformations.

I would advise going with the most high-level API possible (DataFrames), and only switching to RDDs if some operation you need can't be implemented in any other way.

Pandas Price Analysis
Groupby by sum of revenue and the corresponding highest contributing month - Pandas
Select rows from DataFrame where ID count is greater than X
fill nearest value in a column when null of pandas data frame
Calculate transition probabilities
Create new variable based on partial string matching in column R
How to select rows between a certain date range in python-polars?
Sample from each group in polars dataframe?
how to Send dataframe as html table with font styling based on text value as a email attachment
How to subset R dataframe based on specific values in several columns?
Unable to concatenate dataframes in streamlit
Take min and max dates for a sequence along a column
How to perform split/merge/unpivot with Python and polars?
Efficient manner to compare and match structures between two data frames in R?
Find non-overlapping intervals within DNA coordinates
How to expand a single-index DataFrame to a multiindex DataFrame in an efficient way? (python, pandas)
How convert a list into multiple columns and a dataframe?
(Polars) How to get element from a column with list by index specified in another column
Cumulative subtraction in Pandas Dataframe?
Create Pivot table and add additional columns from another dataframe
How can I map a field of a polars struct from values of another field `a`, to values of another field `b`?
Polars pivot + unpivot operation with multiple values (pandas stack / unstack alternative / UDF over)
How can I subclass a Pandas DataFrame?
Modifing a Pandas Dataframe using Pivot Tables or Group By
Why polars date time subseting is slow?
Iterating and updating a column using a list
Pandas create % and # distribution list in descending order for each group
Store numpy.array in cells of a Pandas.DataFrame
Pandas - Merge DataFrame, keep non-null values on common columns, keep average on another column
Multiply Columns Together Based on Condition