Search code examples
apache-sparkapache-spark-sqlapache-spark-dataset

Dataframes and Datasets in Spark


I am new to Spark and was going through Dataframes and Dataset. I was trying the understand the difference between them but I am confused.
I started here and found that the abstraction of RDD happened in the following order.

RDD (Spark1.0) —> Dataframe(Spark1.3) —> Dataset(Spark1.6)

Q.1 On the link here, it says Dataframe is alias for Dataset[Row] i.e. Dataset of type Row. If Dataframe was the abstraction of RDD that was done first does that mean Dataset already existed from Spark1.3 or when Spark1.6 was developed Dataframe was redefined as Dataset[Row]?

Q.2 On the link here, its says,

"A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row."

If, Dataframe is actually Dataset[Row] why is Dataframe called untyped? Isnt the type here supposed to be Row [defined here]?

Q.3 Also if Dataframe is Dataset[Row], then why define Dataframe separately? Also every operation of Dataset should also be callable on Dataframe. If the statement above is not true or somewhat true please feel free to answer.

If these questions feel to broad, please let me know. I will edit them as required.


Solution

    1. In short continuity of the external API (from Shark, now removed, through SchemaRDD and DataFrame to Dataset[Row]) doesn't mean internal continuity. The internal API went through significant changes and current implementation doesn't even resemble the initial Spark SQL attempts.

    There was not such thing as Dataset in 1.3, nor DataFrame has been unified with Dataset until 2.0.

    1. That's hardly precise description (same as highly informal usage of "strong typing"). It refers to two facts:
    • Row being a container (collection-ish) of Any hence no meaningful static typing is possible. It doesn't mean that it is "untyped" (Any is a valid element of type hierarchy), but it simply doesn't provide useful information to the compiler.

    • Lack of type checking at the DataFrame DSL level, which same as the other point is rather misleading, as it fully preserves type system constraints.

    So fundamentally dataframe is "untyped" relatively to some idealistic, and not existent system, in which compiler protects for all possible runtime failures. In more realistic scenario, the given point compares chosen implementation (Dataset[Row]) and type oriented variants like frameless (which do type checking at DSL level) (and in turn are subject of some practical limitations of JVM as a platform such as wide data).

    1. Also if Dataframe is Dataset[Row], then why define Dataframe separately? Also every operation of Dataset should also be callable on Dataframe.

    That's correct, but it doesn't imply the opposite. Dataset[Row] is a specific case of Dataset - it has to provide at least as much as Dataset[_] in general, but can provide more. And this is exactly the case.

    Additionally it is "a special case" to preserve backward compatibility, especially when "strongly typed" variant, is much less popular, and in general less efficient, than specialized DataFrame operations.