Search code examples
javahbase

Performance cost of HBase table with a high number of versions per row?


We are implementing a HBase storage mechanism which will have one table that will make use of a (String) row key and (long) timestamp to maintain multiple versions of a single row. This is a core feature of HBase, and will be very useful for us.

In most cases, the rows will only have a dozen or so versions, and each version should only be a few KB in size across all cells. However, there is an edge case in which a row could have hundreds of versions, each with a different timestamp, and its unclear if there would be any performance or scaling cost to setting the max number of versions per row (just on this one table) to "1000" (one thousand).

In terms of access patterns, when we pull data out it will be one of:

  1. Pull out the "latest" version of the row, given a row key
  2. Pull out a specified version of the row, given a row key and timestamp
  3. Pull out a single cell (called "ts") containing a long from each version of the row, given a row key

The last, in 3) is to allow us to discover what versions exist for each row without having to pull out all versions of a row. Worst case scenario; we would end up getting back 1000 (one thousand) long's in a HBase Get request. That would be 64 Kb. We will never have a need to request every single cell on every single version of a row in one Get request.

There has been a suggestion internal to the team that this could cause performance issues, however, we can't find clarification either way in the HBase manual.

So, given the above, my question is - Is there any kind of performance cost to us having a table with (potentially) 1000 versions per row?


Solution

  • A {row, column, version} tuple exactly specifies a cell in HBase. It’s possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.

    While rows and column keys are expressed as bytes, the version is specified using a long integer. ..... links

    As you see HBase is designed to have a maximum version of Integer.MAX_VALUE, but if you insert a version close to that number, there may be a lot of risk waiting for you.

    1. Number of Versions 37.1. Maximum Number of Versions The maximum number of row versions to store is configured per column family via HColumnDescriptor. The default for max versions is 1. This is an important parameter because as described in Data Model section HBase does not overwrite row values, but rather stores different values per row by time (and qualifier). Excess versions are removed during major compactions. The number of max versions may need to be increased or decreased depending on application needs.

    It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size.

    From the official document We can get some information about your question

    First of all it is likely to be out of memory when compact .

    Secondly the region of a single rowkey will not be splited .