Let's assume I have a data set I want to use in Spark that contains details about users like
id, name, age
123, john, 23
222, Josh, 50
333, bill, 32
Let's say I generate/find a new fact about those users, 'email'.
id, email
123, john@gmail.com
222, Josh@gmail.com
333, bill@gmail.com
Does a storage format exist that would let me dynamically add my new fact to my old dataset without requiring a full rewrite? Basically adding an append-only column?
Try KUDU Storage Manager - not a storage format. Need to be on Cloudera Stack though. Now with HortonWorks, not sure what that means though.
KUDU works well, i.e. no re-stating required. Updating possible, i.e. mutable, but non-ACID. Latter aspect not required though.
Schema evolution AVRO otherwise for Hive / HDFS.