Search code examples
c#utf-8hbase

UTF-8 is not working for converting byte[] to string


I have a qualifier(long value type) in row of H-Base table.

I want to fetch H-Base rows in between of two long numbers. For that I am using following filters.

My filters are like :

long startEpochInDay = 384;

long endEpochInDays = 396;

string startDayFilter = "SingleColumnValueFilter('" + cf + "','" + qualifier + "', >= ,'binary:" + Encoding.UTF8.GetString(HBaseGenericHelper.GetBigEndianByteArray(startEpochInDays)) + "',true,true)";

string endDayFilter = "SingleColumnValueFilter('" + cf + "','" + qualifier + "', < ,'binary:" + Encoding.UTF8.GetString(HBaseGenericHelper.GetBigEndianByteArray(endEpochInDays)) + "',true,true)";

string finalFilter = startDayFilter + " AND " + endDayFilter

These filters are working fine with number less than 383, but fails if number is greater than this number.

I found while debugging while converting long number to Byte array it returns byte array like \0\0\0\0\0\0\1\128.

When last number in byte array is 127 or less, UTF-8 works fine but as this number becomes 128 or greater than that, UTF-8 started returning "?" for last digit.

If I use following method to encoding byte array to string

Encoding encoding = new UTF8Encoding(true,true);
string number = encoding.GetString(HBaseGenericHelper.GetBigEndianByteArray(startEpochInDays));

UTF-8 is throwing exception while converting byte array(if last digit is 128 or more in byte array) to string in filter.

Exception - Unable to translate bytes [8B] at index 6 from specified code page to Unicode.

Inner Exception -

at System.Text.DecoderExceptionFallbackBuffer.Throw(Byte[] bytesUnknown, Int32 index)
at System.Text.DecoderExceptionFallbackBuffer.Fallback(Byte[] bytesUnknown, Int32 index)
at System.Text.DecoderFallbackBuffer.InternalFallback(Byte[] bytes, Byte* pBytes)
at System.Text.UTF8Encoding.GetCharCount(Byte* bytes, Int32 count, DecoderNLS baseDecoder)
at System.String.CreateStringFromEncoding(Byte* bytes, Int32 byteLength, Encoding encoding)
at System.Text.UTF8Encoding.GetString(Byte[] bytes, Int32 index, Int32 count)
at System.Text.Encoding.GetString(Byte[] bytes)

Thanks in Advance.


Solution

  • UTF8 is not an appropriate way of encoding arbitrary bytes as a string. Rather: it encodes arbitrary strings as bytes (and vice-versa, as long as the bytes are in the correct format). There is no reason to think that HBaseGenericHelper.GetBigEndianByteArray(startEpochInDays) returns UTF-8 data, so encoding.GetString is entirely inappropriate and is actually using the Encoding backwards. This is the first topic I discussed here - so don't panic: you're in good company - people make this mistake all the time.

    What you should be using is something like base-16 (hexadecimal) or base-64.

    To get hex: BitConverter.ToString(byte[]). To get base-64: Convert.ToBase64String(byte[])

    If you need the data to be in a particular format that isn't base-64 or base-16, then you'll have to be specific about what format you want. But: it isn't "UTF-8 used backwards".