Search code examples
solrhighlightingkey-value

"2d Search" in Solr or how to get the best item of the multivalued field 'items'?


The title is a bit awkward but I couldn't found a better one. My problem is as follows:

I have several users stored as documents and I am storing several key-value-pairs or items (which have an id) for each document. Now, if I apply highlighting with hl.snippets=5 I can get the first 5 items. But every user could have several hundreds items, so

  • you will not get the most relevant 5 items. You will get the first 5 items ...

Another problem is that

  • the highlighted text won't contain the id and so retrieving additional information of the highlighted item text is ugly.

Example where items are emails:

user1 has item1 { text:"developers developers developers", id:1, title:"ms" }
          item2 { text:"c# development",                   id:2, title:"nice!" }
          ...
          item77 ...

user2 has item1 { text:"nice restaurant", id:3, title:"bla"}
          item2 { text:"best cafe",       id:4, title:"blup"}
          ...
          item223 ...

Now if I use highlighting for the text field and query against "restaurant" I get user2 and the text nice <b>restaurant</b>. But how can I determine the id of the highlighted text to display e.g. the title of this item? And what happens if more relevant items are listed at the end of the item-list? Highlighting won't display those ...

So how can I find the best items of a documents with multiple such items?

I added my two findings as answers, but as I will point out each of them has its own drawbacks.

Could anyone point me to a better solution?


Solution

  • You could use use two indices: users->items as described in the question and an index with 'pure items' referencing back to the user.

    Then you will need 2 queries (thats the reason I called the question '2d Search in Solr'):

    1. query the user index => list of e.g. 10 users
    2. query the items index for each user of the 1. step => best items

    Assume the following example:

    userA emails are "restaurant X is bad but restaurant X is cheap", "different topic", "different topicB" and

    userB emails are "restaurant X is not nice", "revisited restaurant X and it was ok now", "again in restaurant X and I think it is the best".

    Now I query the user index for "restaurant X" and the first user will be userB, which is what I want. If I would query only the item-index I would get the item1 of less relevant userA.

    Drawbacks:

    • bad performance, because you will need one query against the user index and e.g. 10 more to get the most relevant items for each user.
    • maintaining two indices.

    Update to avoid many queries I will try the following: using the user index to get some highlighted snippets and then offering a 'get relevant items'-button for every user which then triggers a query against the item index.