Search code examples
apache-sparkbigdatastoragedataformat

Is there a data storage format that allows for appending columns?


Let's assume I have a data set I want to use in Spark that contains details about users like

id, name, age
123, john, 23
222, Josh, 50
333, bill, 32

Let's say I generate/find a new fact about those users, 'email'.

id, email
123, [email protected]
222, [email protected]
333, [email protected]

Does a storage format exist that would let me dynamically add my new fact to my old dataset without requiring a full rewrite? Basically adding an append-only column?


Solution

  • Try KUDU Storage Manager - not a storage format. Need to be on Cloudera Stack though. Now with HortonWorks, not sure what that means though.

    KUDU works well, i.e. no re-stating required. Updating possible, i.e. mutable, but non-ACID. Latter aspect not required though.

    Schema evolution AVRO otherwise for Hive / HDFS.