Search code examples
hadoophbasehdfs

Why exported HBase table is 4 times bigger than its original?


I need to backup HBase table before update to a newer version. I decided to export table to hdfs with standard Export tool and then move it to local file system. For some reason exported table is 4 times larger than original one:

hdfs dfs -du -h
1.4T    backup-my-table

hdfs dfs -du -h /hbase/data/default/
417G    my-table

What can be the reason? Is it somehow related to compression?

P.S. Maybe the way I made the backup matters. First I made a snapshot from target table, then cloned it to a copy table, then deleted unnecessary column families from this copied table (so I expected the result size to be 2 times smaller), then I run export tool on this copy table.


upd for future visitors: here's the correct command to export table with compression

./hbase org.apache.hadoop.hbase.mapreduce.Export \
 -Dmapreduce.output.fileoutputformat.compress=true \
 -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec \
 -Dmapreduce.output.fileoutputformat.compress.type=BLOCK \
 -Dhbase.client.scanner.caching=200 \
  table-to-export export-dir

Solution

  • May be you compressed using SNAPPY or some other compression technique. like this

    create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' }
    

    Compression support Check

    Use CompressionTest to verify snappy support is enabled and the libs can be loaded ON ALL NODES of your cluster:

    $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy
    

    Export Command Source to apply compression:

    If you dig deep to understand Export command (source), then you will find

    see below properties which could reduce size drastically..

    mapreduce.output.fileoutputformat.compress=true

    mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec

    mapreduce.output.fileoutputformat.compress.type=BLOCK

    /*
       * @param errorMsg Error message.  Can be null.
       */
      private static void usage(final String errorMsg) {
        if (errorMsg != null && errorMsg.length() > 0) {
          System.err.println("ERROR: " + errorMsg);
        }
        System.err.println("Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> " +
          "[<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]]\n");
        System.err.println("  Note: -D properties will be applied to the conf used. ");
        System.err.println("  For example: ");
        System.err.println("   -D mapreduce.output.fileoutputformat.compress=true");
        System.err.println("   -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec");
        System.err.println("   -D mapreduce.output.fileoutputformat.compress.type=BLOCK");
        System.err.println("  Additionally, the following SCAN properties can be specified");
        System.err.println("  to control/limit what is exported..");
        System.err.println("   -D " + TableInputFormat.SCAN_COLUMN_FAMILY + "=<familyName>");
        System.err.println("   -D " + RAW_SCAN + "=true");
        System.err.println("   -D " + TableInputFormat.SCAN_ROW_START + "=<ROWSTART>");
        System.err.println("   -D " + TableInputFormat.SCAN_ROW_STOP + "=<ROWSTOP>");
        System.err.println("   -D " + JOB_NAME_CONF_KEY
            + "=jobName - use the specified mapreduce job name for the export");
        System.err.println("For performance consider the following properties:\n"
            + "   -Dhbase.client.scanner.caching=100\n"
            + "   -Dmapreduce.map.speculative=false\n"
            + "   -Dmapreduce.reduce.speculative=false");
        System.err.println("For tables with very wide rows consider setting the batch size as below:\n"
            + "   -D" + EXPORT_BATCHING + "=10");
      }
    

    Also see getExportFilter which might be useful in your case to narrow your export.

      private static Filter getExportFilter(String[] args) { 
    138     Filter exportFilter = null; 
    139     String filterCriteria = (args.length > 5) ? args[5]: null; 
    140     if (filterCriteria == null) return null; 
    141     if (filterCriteria.startsWith("^")) { 
    142       String regexPattern = filterCriteria.substring(1, filterCriteria.length()); 
    143       exportFilter = new RowFilter(CompareOp.EQUAL, new RegexStringComparator(regexPattern)); 
    144     } else { 
    145       exportFilter = new PrefixFilter(Bytes.toBytesBinary(filterCriteria)); 
    146     } 
    147     return exportFilter; 
    148   }