Say I have the following collection of webpages in a Solr index:
+-----+----------+----------------+--------------+
| ID | Domain | Path | Content |
+-----+----------+----------------+--------------+
| 1 | 1.com | /hello1.html | Hello dude |
| 2 | 1.com | /hello2.html | Hello man |
| 3 | 1.com | /hello3.html | Hello fella |
| 4 | 2.com | /hello1.html | Hello sir |
...
And I want a query for hello
to show results grouped by domain like:
Results from 1.com:
/hello1.html
/hello2.html
/hello3.html
Results from 2.com:
/hello1.html
How is ordering determined if I sort by score? I use a combination of TF/IDF and PageRank for my results normally, but since that calculates scores for each individual item, how does it determine how to order the gruops? What if 1.com/hello3.html and 1.com/hello2.html have very low relevance but two results while 2.com/hello1.html has really high relevance and only one result? Or vice versa? Or is relevance summed when there are multiple items in a grouping field?
I've looked around, but haven't been able to find a good answer to this.
Thanks.
It sounds to me like you are using Result Grouping. If that's the case, then the groups are sorted according to the sort
parameter, and the records within each group are sorted according to the group.sort
parameter. If you sort the groups by sort=score desc
(this is the default, so you wouldn't actually need to specify it), then it sorts the groups according to the score of each group. How this score is determined isn't made very clear, but if you look through the examples in the linked documentation you can see this statement:
The groups are sorted by the score of the top document within each group.
So, in your example, if 2.com's hello1.html was the most relevant document in your result set, "Results from 2.com" would be your most relevant group even though "Results from 1.com" includes three times the document count.
If this isn't what you want, your best options are to provide a different sort
parameter or result post-processing. For example, for one project I was involved in, (where we had a very modest number of groups,) we chose to pull the top three results for each group and in post processing we calculated our own sort order for the groups based on the combination of their scores and numFound values. This sort of strategy might have been prohibitive for cases with too many groups, and may not be a good idea if the more numerous groups run the risk of making the most relevant documents harder to find.