I have recently started experimenting Hbase and hadoop stack. I am trying to build an application from scratch. I am designing my schema for my application which will be using google n-gram data set.
I realize that the data set can be made into a model which has ngram as row key and one column family with many qualifiers(Year,page count,match_count) or the model can have n-gram as row-key and multiple column families for Year,page_count,match_count.
I realize the model depends on the way I would like to use this data, but I would like to understand the advantages and disadvantages of both of these approach.
Cheers, Dwarak
Consider reading this chapter from the Hbase book : 6.2. On the number of column families
"HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed though the amount of data they carry is small. When many column families the flushing and compaction interaction can make for a bunch of needless i/o loading (To be addressed by changing flushing and compaction to work on a per column family basis)."
"Try to make do with one column family if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the one time"
Now, keep in mind that physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics. If all your data will be processes at the same time then you might want to consider having a table with only one column family. You better not use multiple families unless they are used separately almost all the time.