I am using Solr for (an unusual?) use-case of providing ranked results for numeric data./
Say I have a record-set of a set of Objects O {O1...On} and for each of those objects I have multiple measurements: e.g. Viscosity, Porosity, Permeability etc.
For an On+1 object, I need to search the above record-set to find the most "similar" (along the multiple dimensions of Viscosity, Porosity, Permeability) etc.
Since the record-set O is hundreds of millions records, it is practically impossible to run against each a similarity metric such as Cosine, or Minkowski. I need to prune the result-set to a top 100 or so candidates and I'm using Solr to run a query.
I run a range query using the parameters of the On+1 object e.g. Porosity between [9.5 TO 10.5] so +/-5% of a value, and Boolean query chain them to get a ranked list of matches.
My questions:
Is there a better way of doing this and obtaining a score from Solr that I could use, perhaps to threshold. The current range query method score seems to follow a step function and unhelpful.
Could I persist the numbers in a text_general format and search using the query numbers? Since the quert strings could run very long, am unsure how to approach this, perhaps using MLT?
Any ideas? or suggestions for other toolkits to help with the above?
As you said, the range query won't work here for scoring... but it's still a good way to filter the initial index.
Once the index is filtered(or not) with some base query - we can apply custom scoring.
Here's some general example on how to implement a custom scoring: http://spykem.blogspot.com/2013/06/plug-in-external-score-to-solr.html
When implementing a custom sorting - the CustomScoreProvider can receive following parameters:
The additional score will be lowered by "Score step" each time the distance between field value and query value will expand by "Value step", starting from "Max additional score" and until it reaches zero.
The additional scoring formula will look something like this (until it reaches zero):
Max additional score - ((|fieldValue - queryValue| / Value Step ) * Score Step)
So, for example, having following settings:
with following index values for some field (e.g. permeability):
and if the initial search query looks like this:
q={!nearestParser valueStep=0.1 scoreStep=0.01 maxStep=1}permeability:5
Then the result will look like (assuming the initial score is the same (1) for all docs)
Conclusion:
I will try to come with some practical example, but as it will take some time, I though it will be better to answer with the idea for now.
After reading about NumericRangeQuery I also had an idea about using Trie* field structure (to be specific - leverage it's ability to handle numeric range search efficiently) in order to find most the nearest value from index... but didn't figured out how to do it yet.
This potentially may be much more performant, though much more complicated... and there's still a chance that Trie* structure cannot handle this sort of operation...