Search code examples
indexingsolrlucenehbasenosql

How to index HBase columns with binary data as SOLR fields?


I need to index my data stored in HBase rows. Obvious solution is to use Lily HBase indexer through replication and push results into SOLR collection.

The root of my problem is I have some 'short binary' columns in my HBase rows like MD5, CRC64, UUID and alike. Of course I store them as raw byte[] representation which saves me lot of space. But I need to index data based on some of such criteria storing actual representation. How to do so in correct way?

  • Currently I see in SOLR only BinaryField as appropriate SOLR field type. But it requires HBase column content to be Base64 encoded and Lily HBase indexer doesn't look like solution to support this.
  • The only option I see through Lily HBase indexer is to configure columns mapping as bigDecimal. Is it applicable in this case? As I understand string itself is not an option.
  • If I use MorphLine, I can base on extractHBaseCells command from Cloudera and type byte[] which is promised to be just transparent pipe. But what should I do with extracted column data to receive SOLR binary field?
  • What about to save lexicographical order for such a binary fields in index after mapping? I'd consider option to map byte[] as sequence for 2-digit hex numbers but is there some good way to map in such way?

Solution

  • Received working solution:

    • Lily HBase indexer is configured for row mapping type. The result is document ID (unique key) being HBase row key.
    • HBase row key with binary data is formatted in this case based on Lily HBase indexer configuration where unique key formatter is set to 'com.ngdata.hbaseindexer.uniquekey.HexUniqueKeyFormatter`. This resulted document ID ('id') SOLR field as sequence of lowercase hex digit string matching row key binary representation. Probably can be better but at least works as expected. Note 'id' SOLR field is of type string here.
    • Binary cells are transformed by Morphline based on extractHBaseCells command from Cloudera Search. Mapping with type byte[] is used which happened to produce exactly Base64 encoded fields.

    UPDATE 1:

    • Added HBASE_INDEXER_CLASSPATH environment configuration for HBase indexer and additional class extending com.ngdata.hbaseindexer.uniquekey.BaseUniqueKeyFormatter which now performs Base64 encoding for unique key so it can be declared as BinaryField. This finally did ALL things I demand from indexer. So now SOLR receives correct 'update' requests with Base64-encoded 'id' field and fields mapped from other needed columns.

    UPDATE 2:

    • After played enough with solr.BinaryField I came to just plain solr.StrField for everything that I need to index AS IS. In case of binary bytes strings like hashes they are transformed into sequence of lowercase hex digits, 2 digits per byte. Maybe not the best in term of performance but looks most portable and flexible. For 'just stored' fields I already have Base64 encoder but I don't fields in SOLR if I don't index them.