Search code examples
javascalaapache-sparkjvmhdfs

What are the differences between Dataframe, Dataset, and RDD in Apache Spark?


In Apache Spark, what are the differences between those API? Why and when should we choose one over the others?


Solution

  • First, lets define what spark does

    • Simply put what it does is to execute operations on distributed data. Thus, the operations also need to be distributed. Some operations are simple, such as filter out all items that doesn't respect some rule. Others are more complex, such as groupBy that needs to move data around, and join that needs to associate items from 2 or more datasets.

    • Another important fact is that input and output are stored in different formats, spark has connectors to read and write those. But that means to serialize and deserialize them. While being transparent, serialization is often the most expensive operation.

    • Finally, spark tries to keep data in memory for processing but it will [ser/deser]ialize data on each worker locally when it doesn't fit in memory. Once again, it is done transparently but can be costly. Interesting fact: estimating the data size can take time

    The APIs

    • RDD

    It's the first API provided by spark. To put is simply it is a not-ordered sequence of scala/java objects distributed over a cluster. All operations executed on it are jvm methods (passed to map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be applied to the jvm objects there. This is pretty much the same as using a scala Seq, but distributed. It is strongly typed, meaning that "if it compiles then it works" (if you don't cheat). However, there are lots of distribution issues that can arise. Especially if spark doesn't know how to [de]serialize the jvm classes and methods.

    • Dataframe

    It came after and is semantically very different from RDD. The data are considered as tables and operations such as sql operations can be applied on it. It is not typed at all, so error can arise at any time during execution. However, there are I think 2 pros: (1) many people are used to the table/sql semantic and operations, and (2) spark doesn't need to deserialize the whole line to process one of its column, if the data format provide suitable column access. And many do, such as the parquet file format that is the most commonly used.

    • Dataset

    It is an improvement of Dataframe to bring some type-safety. Dataset are dataframe to which we associate an "encoder" related to a jvm class. So spark can check that the data schema is correct before executing the code. Note however that, we can read sometime that dataset are strongly type, but it is not: it brings some strongly type safety where you cannot compile code that use a Dataset with a type that is not what has been declared. But it is very easy to make code that compile but still fail at runtime. This is because many dataset operations loose the type (pretty much everything apart from filter). Still it is a huge improvements because even when we make mistake, it will fail fast: failure happens when interpreting the spark DAG (i.e. at start) instead of during data processing.

    Note: Dataframe are now simply untyped Dataset (Dataset<Row>)

    Note2: Dataset provide the main API of RDD, such as map and flatMap. From what I know, it is a short cut to convert to rdd, then apply map/flatMap, then convert to dataset. It's practical, but also hide the conversion making it difficult to realize that possibly costly ser/deser-ialization happened.

    Pros and cons

    • Dataset:

      • pros: has optimized operations over column oriented storages
      • pros: also many operations doesn't need deserialization
      • pros: provide table/sql semantic if you like it (I don't ;)
      • pros: dataset operations comes with an optimization engine "catalyst" that improves the performance of your code. I'm not sure however if it is really that great. If you know what you code, i.e. what is done to the data, your code should be optimized by itself.
      • cons: most operation loose typing
      • cons: dataset operations can become too complicated for complex algorithm that doesn't suit it. The 2 main limits I know are managing invalid data and complex math algorithm.
    • Dataframe:

      • pros: required between dataset operations that lose type
      • cons: just use Dataset it has all the advantages and more
    • RDD:

      • pros: (really) strongly typed
      • pros: scala/java semantic. You can design your code pretty much how you would for a single-jvm app that process in-memory collections. Well, with functional semantic :)
      • cons: full jvm deserialization is required to process data, at any step mentioned before: after reading input, and between all processing steps that requires data to be moved between worker, or stored locally to manage memory bound.

    Conclusion

    • Just use Dataset by default:

      • read input with an Encoder, if the data format allows it it will validate input schema at start
      • use dataset operations and when you loose type, go back to a typed dataset. Typically, use typed dataset as input and output of all methods.
    • There are cases where what you want to code would be too complex to express using dataset operations. Most app doesn't, but it often happen in my work where I implements complex mathematical models. In this case:

      • start with dataset
      • filter and shuffle (groupBy, join) data as much as possible with dataset ops
      • once you have only the required data, and need not move them, convert to rdd and apply you complex computing.