Search code examples
hadoopsolrhbaseapache-phoenix

hbase-indexer solr numFound different from hbase table rows size


Recently my team is using hbase-indexer on CDH for indexing hbase table column to solr . When we deploy hbase-indexer server (which is called Key-Value Store Indexer) and begin testing. We found a situation that the rows size between hbase table and solr index is different :

We used Phoenix to count hbase table rows:

0: jdbc:phoenix:slave1,slave2,slave3:2181> SELECT /*+ NO_INDEX */  COUNT(1) FROM C_PICRECORD;

+------------------------------------------+
|                 COUNT(1)                 |
+------------------------------------------+
| 4084355                                  |
+------------------------------------------+

And we use Solr Web UI to count solr index size :

numFound : 4060479

We could not found any error log from hbase-indexer log and solr log. But the rows size between hbase table and solr index is really different ! Is there anyone meet this situation ? I don't know how to do


Solution

  • My understanding :

    Hbase rowcount - Solr rowcount(numfound) = missing records

    4084355 - 4060479 = 23876 (which are there in Hbase and missing in Solr)

    The Key-Value Store Indexer service uses the Lily HBase NRT Indexer to index the stream of records being added to HBase tables.

    NRT works on incremental data not whole data.

    Out of my experience these are possible reasons :

    1) NRT worked initially, and if suddenly NRT is not working(due to some health issues) then there is a possibility of discrepancy in numbers.

    2) NRT works on WAL(write ahead log) if WAL is switched off while inserting the records in to HBASE (possible.. for performance reasons), NRT wont work.

    Possible solution : 1) Delete Solr documents and freshly load data in to Solr from Hbase. Hbase batch indexer you can run on whole data (Batch indexer wont work on incremental data, it works on whole dataset)

    2) As part of data-flow pipe line, Write a map-reduce program to insert the data in to solr.(what we have done in one of our implementation)