Search code examples
hadoopavro

1 flatfile, different schema depending on a value


If I have one flat file, and the number of columns changes with a input field,

Eg:

A,0,00,01,Alex

B,2,h

A,2,22,02,Paul

C,99

So here A has 4 fields(id,number,rank,name) B has 2 fields(weight,height) Similarly for C.

Now what is the best way to store this data(Hive or Hbase). Because I need to query the data for analytics purpose. Also let me know the best method to do it?

Also can Avro schema be created depending on the first field's input? Please help..


Solution

  • If you have a single file, Hive can't query multiple rows that have altering schemas

    The best you could do with Hive would be define every column for the maximum width of your labels, then the rest of "empty columns" will be NULL. It could work, but it wouldnt look clean when you query.

    I'm not familiar with Hbase, sorry.

    As for Avro, one avro file can only have one schema. Therefore, like Hive, you would need to define every field and default values for the rows without columns

    Personally, I use Pig or Spark to filter your labels, write them to different files, and then create Hive (or Hbase possibly) tables with them. Assuming that you actually needed a persistent query layer rather than simply process the data all in Spark from the raw files

    You can expose the Spark Thriftserver for interactive queries