From Hbase book I noticed that there is a important conception named "region".
Such as :
Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed even though the amount of data they carry is small
Around 50-100 regions is a good number for a table with 1 or 2 column families. Remember that a region is a contiguous segment of a column family
It seems that one "region" is belong to one or more colume family?
I confused about what is "region" exactly
If you check how HBase data is stored on hdfs:
user@host:~$ hdfs dfs -du -h /hbase/data/default/traffic
284 /hbase/data/default/traffic/.tabledesc
0 /hbase/data/default/traffic/.tmp
382.8 M /hbase/data/default/traffic/08ec69a079692f404c8d2949066f569b
124.1 M /hbase/data/default/traffic/0d986ba711e8dee5458090f98cccd446
110.9 M /hbase/data/default/traffic/0ea846c84192e3a744a4de907895351e
271.0 M /hbase/data/default/traffic/0f1682446b5331bdebbdee64b5a20c4f
198.3 M /hbase/data/default/traffic/0f349f966564ae0e87e927cc079aec86
...
you will see that there are many folders with hashed names - each folder contains region data.
Inside each region you will see folders that are grouped by column families:
user@host:~$ hdfs dfs -du -h /hbase/data/default/traffic/f51ec9f3170e9abaf44537e96ebf8560
163 /hbase/data/default/traffic/f51ec9f3170e9abaf44537e96ebf8560/.regioninfo
243.8 M /hbase/data/default/traffic/f51ec9f3170e9abaf44537e96ebf8560/r
124.2 M /hbase/data/default/traffic/f51ec9f3170e9abaf44537e96ebf8560/z
In my case I have two column families with names r
and z
. Inside column families folders you will find hfiles.
Answering your question: region is a part of the table with a specific diapason of keys. It contains all the column families of the table. If you edit table schema and add new column family, all regions for this table will be updated.