Search code examples
hadoophbasebulk-loadhfile

Bulk loading with LoadIncrementalHFiles and subdirectories


I wrote a Spark application that generates HFiles to be used for bulk loading with the LoadIncrementalHFiles command later. As the source data pool is very big, the input files are splitted into iterations that are processed one after the other. Each iteration creates its own HFile directory, so my HDFS structure looks like this:

/user/myuser/map_data/hfiles_0
         ...         /hfiles_1
         ...         /hfiles_2
         ...         /hfiles_3
                     ...

There are about 500 files in this map_data directory, therefore I'm searching for a way to automatically call the LoadIncrementalHFiles function, to process these subdirectories also in iterations later.

The corresponding command would be this:

hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable

I need to change this into an iterative command, as this command does not work with subdirectories (when I call it with the /user/myuser/map_data directory)!

I tried to use a Java Process instance to execute the command above automatically, but this doesn't seen to do anything (no output to console and also no more rows in my HBase table).

Using the org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles Java class out of my code also doesn't work, it's also not responsing!

Has anybody a working example for me? Or is there a parameter to be able to run the above hbase command on the parent directory? I'm working with HBase 1.1.2 in a Hortonworks Data Platform 2.5 cluster.

EDIT I tried to run the LoadIncrementalHFiles command from a Hadoop client Java application, but I'm getting an exception relating to snappy compression, see Run LoadIncrementalHFiles from Java client


Solution

  • The solution was to split the hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable command into many parts (one per command part), see this Java code snippet:

    TreeSet<String> subDirs = getHFileDirectories(new Path(HDFS_PATH), hadoopConf);
    
    for(String hFileDir : subDirs) {
    
        try {
            String pathToReadFrom = HDFS_OUTPUT_PATH + "/" + hFileDir;
            ==> String[] execCode = {"hbase", "org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles", "-Dcreate.table=no", pathToReadFrom, hbaseTableName};
            ProcessBuilder pb = new ProcessBuilder(execCode);
            pb.redirectErrorStream(true);
            final Process p = pb.start();
    
            // Write the output of the Process to the console
            new Thread(new Runnable() {
                public void run() {
                    BufferedReader input = new BufferedReader(new InputStreamReader(p.getInputStream()));
                    String line = null; 
    
                    try {
                        while ((line = input.readLine()) != null)
                            System.out.println(line);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
            }).start();
    
        // Wait for the end of the execution
        p.waitFor();
        ...
    }