Search code examples
databasesearchsolrsearch-enginenumerical-methods

Searching numeric data using Solr


I am using Solr for (an unusual?) use-case of providing ranked results for numeric data./

  1. Say I have a record-set of a set of Objects O {O1...On} and for each of those objects I have multiple measurements: e.g. Viscosity, Porosity, Permeability etc.

  2. For an On+1 object, I need to search the above record-set to find the most "similar" (along the multiple dimensions of Viscosity, Porosity, Permeability) etc.

  3. Since the record-set O is hundreds of millions records, it is practically impossible to run against each a similarity metric such as Cosine, or Minkowski. I need to prune the result-set to a top 100 or so candidates and I'm using Solr to run a query.

I run a range query using the parameters of the On+1 object e.g. Porosity between [9.5 TO 10.5] so +/-5% of a value, and Boolean query chain them to get a ranked list of matches.

My questions:

  1. Is there a better way of doing this and obtaining a score from Solr that I could use, perhaps to threshold. The current range query method score seems to follow a step function and unhelpful.

  2. Could I persist the numbers in a text_general format and search using the query numbers? Since the quert strings could run very long, am unsure how to approach this, perhaps using MLT?

Any ideas? or suggestions for other toolkits to help with the above?


Solution

  • Theory

    As you said, the range query won't work here for scoring... but it's still a good way to filter the initial index.

    Once the index is filtered(or not) with some base query - we can apply custom scoring.

    Here's some general example on how to implement a custom scoring: http://spykem.blogspot.com/2013/06/plug-in-external-score-to-solr.html


    When implementing a custom sorting - the CustomScoreProvider can receive following parameters:

    • Value step - step to lower the score
    • Score step - lower the score by this value whenever "value step" occurs
    • Max additional score - "perfect match" will have that score in addition to native score(from reqular search query), non-perfect matches will have a lowered (non-negative) value

    The additional score will be lowered by "Score step" each time the distance between field value and query value will expand by "Value step", starting from "Max additional score" and until it reaches zero.

    The additional scoring formula will look something like this (until it reaches zero):

    Max additional score - ((|fieldValue - queryValue| / Value Step ) * Score Step)
    

    Example

    So, for example, having following settings:

    • Value step = 0.1
    • Score step = 0.01
    • Max additional score = 1

    with following index values for some field (e.g. permeability):

    • 3 (for doc1)
    • 5 (for doc2)
    • 6 (for doc3)
    • 7 (for doc4)
    • 99999999 (for doc5)

    and if the initial search query looks like this:

    q={!nearestParser valueStep=0.1 scoreStep=0.01 maxStep=1}permeability:5
    

    Then the result will look like (assuming the initial score is the same (1) for all docs)

    • doc2 (with score - 2.0)
    • doc3 (with score - 1.9)
    • doc1 (with score - 1.8)
    • doc4 (with score - 1.8)
    • doc5 (with score - 1)

    Conclusion:

    • Doc2 will have the best score as it is a perfect match
    • Doc3 will be the second as it is as close as possible(without perfect match) to preffered input (and within score distance)
    • Doc1 and doc4 will have the same score, as they both have the same distance from the initial search query.
    • Doc5 will have the initial score, as it is out-of-range to be considered as "similar"

    I will try to come with some practical example, but as it will take some time, I though it will be better to answer with the idea for now.


    Other possible solution

    After reading about NumericRangeQuery I also had an idea about using Trie* field structure (to be specific - leverage it's ability to handle numeric range search efficiently) in order to find most the nearest value from index... but didn't figured out how to do it yet.

    This potentially may be much more performant, though much more complicated... and there's still a chance that Trie* structure cannot handle this sort of operation...