HBase bulk load usage

I am trying to import some HDFS data to an already existing HBase table. The table I have was created with 2 column families, and with all the default settings that HBase comes with when creating a new table. The table is already filled up with a large volume of data, and it has 98 online regions. The type of row keys it has, are under the form of(simplified version) : 2-CHARS_ID + 6-DIGIT-NUMBER + 3 X 32-CHAR-MD5-HASH.

Example of key: IP281113ec46d86301568200d510f47095d6c99db18630b0a23ea873988b0fb12597e05cc6b30c479dfb9e9d627ccfc4c5dd5fef.

The data I want to import is on HDFS, and I am using a Map-Reduce process to read it. I emit Put objects from my mapper, which correspond to each line read from the HDFS files. The existing data has keys which will all start with "XX181113". The job is configured with :

HFileOutputFormat.configureIncrementalLoad(job, hTable)

Once I start the process, I see it configured with 98 reducers (equal to the online regions the table has), but the issue is that 4 reducers got 100% of the data split among them, while the rest did nothing. As a result, I see only 4 folder outputs, which have a very large size. Are these files corresponding to 4 new regions which I can then import to the table? And if so, why only 4, while 98 reducers get created? Reading HBase docs

In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region. In order to do this, jobs whose output will be bulk loaded into HBase use Hadoop's TotalOrderPartitioner class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.

confused me even more as to why I get this behaviour.

Thanks!

Solution

The number of maps you'd get doesn't depend on the number of regions you have in the table but rather how the data is split into regions (each region contains a range of keys). since you mention that all your new data start with the same prefix it is likely it only fit into a few regions. You can pre split your table so that the new data would be divided between more regions