Search code examples
encodingutf-8hbasestring-decoding

putting german text in hbase table


I am trying to update a table by adding a german string by doing the following: put'table:data_validation_test','58e1f4200f23e474ca2d7f3a','urlbody:data','Auslöser' What I get on scanning this table is this:

scan 'table:data_validation_test'
ROW                                  COLUMN+CELL                                                                                               
 58e1f4200f23e474ca2d7f3a            column=urlbody:data, timestamp=1491215905923, value=Ausl\xC3\xB6ser                                       
 58e1f4200f23e474ca2d7f3a            column=urlbody:id, timestamp=1491215697534, value=58e1f4200f23e474ca2d7f3a

I can't find a way to set encoding strings in hbase. How can I get the string as it is into Hbase?


Solution

  • This is just an output issue of the scan command (the same happens with get). In fact, your string is correctly stored.

    This happens here because ö (\xC3\xB6) is encoded on 2 bytes, and \xC3 and \xB6 cannot be displayed as readable characters. Remember that in HBase, the main type is Array[Byte].

    If you try to get your string value using JRuby (inside HBase shell) :

    include Java
    import org.apache.hadoop.hbase.HBaseConfiguration
    import org.apache.hadoop.hbase.client.HTable
    import org.apache.hadoop.hbase.client.Get
    import org.apache.hadoop.hbase.util.Bytes
    
    config = HBaseConfiguration.create
    htable = HTable.new(conf, 'table:data_validation_test')
    result = htable.get(Get.new('58e1f4200f23e474ca2d7f3a'.to_java_bytes))
    
    puts Bytes.toString(result.getValue('urlbody'.to_java_bytes, 'data'.to_java_bytes))
    

    Then, your value should be displayed properly.