Search code examples
apache-sparkrdd

Why does Spark emphasize decoupling schema, mem, storage?


Throughout the Spark literature, I've repeatedly seen mentions to things like the quote below. As well as decoupling the schema from storage.

Tools written for HPC environments often fail to decouple the in-memory data models from the lower level storage models.

What is the importance of this decoupling? Is it for microservices benefits or pluggability?


Solution

  • There's an example after that quote in the book... But anyway, Spark has little to do with microservices.

    The book says that the storage that Spark reads from is able to be separated across many commodity hardware machines. This is enabled through Hadoop Compatible filesystems, whether HDFS, S3, or others. Compared to other HPC systems that only understand the UNIX filesystem layer, Hadoop provides a more consistent API over different types of storage.

    For in memory storage, Spark does have pluggable serializers.

    For on disk storage, Spark (via Hadoop) can store data in formats that contain self-described schemas that is read on-request rather than predefined and stored externally, for example, in a database.

    As compared to the HPC systems the book talks about, other Hadoop related tools can read the same files that Spark can, so you're not tied to one proprietary format that'll only work with that HPC environment, therefore it's decoupled