Search code examples
scalaapache-sparkhbasehadoop2

Scala: Creating a HBase table with pre splitting region based on Row Key


I have three RegionServers. I want to evenly distribute a HBase table onto three regionservres based on rowkeys which I have already identified (say, rowkey_100 and rowkey_200). It can be done from hbase shell using:

create 'tableName', 'columnFamily', {SPLITS => ['rowkey_100','rowkey_200']} 

If I am not mistaken, this 2 split points will create 3 regions, and the first 100 rows will go to the 1st regionserver, next 100 rows will be in 2nd regionserver and the remaining rows in last regionserver. I want to do the same thing using scala code. How can I specify this in scala code to split table into regions?


Solution

  • Below is a Scala snippet for creating a HBase table with splits:

    val admin = new HBaseAdmin(conf)
    
    if (!admin.tableExists(myTable)) {
      val htd = new HTableDescriptor(myTable)
      val hcd = new HColumnDescriptor(myCF)
      val splits = Array[Array[Byte]](splitPoint1.getBytes, splitPoint2.getBytes)
    
      htd.addFamily(hcd)
      admin.createTable(htd, splits)
    }
    

    There are some predefined region split policies, but in case you want to create your own way of setting split points that span your rowkey range, you can create a simple function like the following:

    def autoSplits(n: Int, range: Int = 256) = {
      val splitPoints = new Array[Array[Byte]](n)
      for (i <- 0 to n-1) {
        splitPoints(i) = Array[Byte](((range / (n + 1)) * (i + 1)).asInstanceOf[Byte])
      }
      splitPoints
    }
    

    Just comment out the val splits = ... line and replace createTable's splits parameter with autoSplits(2) or autoSplits(4, 128), etc.