Search code examples
vespa

Is there a way of performing a weighted elementSimilarity with Vespa?


I having trouble dealing with multivalue query items and fields in terms of element similarity. For example, if we have an array of strings like such:

field colors type array<string>
# That might have several items like: "blue", "black and purple", "green", "yellow", etc

And I wish to query with a list of items:

"blue" (weight 0.5), "black" (weight 1.0)

Is there a way to perform a weighted listwise similarity that might look like: weight * elementSimilarity(blue on colors) + weight * elementSimilarity(black on colors)?

I've tried multiple features, including nativeRank, but I get inconsistent results depending on the length of the query array as well as the field array. As I also want to be able to deal with misspellings, "blu" should have a very high match with "blue" - hence why I prefer elementSimilarity. I think I've tried most of the rank features in vespa, but I haven't found a better way to deal with this use case.

Any guidance would be much appreciated! Thanks!

Edit: Just to elaborate, perhaps the biggest restriction to me in Vespa is how arrays are handled in the query. I would very much like to do something like:

expression {
    foreach(terms,N,query(colors,N).weight*elementSimilarity(query(colors,N)),true,sum)
}

Solution

  • There are many ways to accomplish this but what is best depends on if you need free text style matching (linguistic processing of the string including tokenization and stemming) or not. It also depends on if this is just a ranking signal for documents that are already retrieved or used to retrieve documents.

    If you don't need free text style matching but instead can use exact matching without linguistics processing (e.g using a fixed vocabulary) and this color ranking is just another ranking signal you should consider looking at using tensor ranking instead. Tensors are useful for ranking documents that are retrieved by the query operators, you cannot retrieve using a tensor (except for dense single order tensors using approximate nearest neighbor search). See tensor guide https://docs.vespa.ai/en/tensor-user-guide.html.

    If you need free text style matching there are also several approaches. In the below example I assume that you want to have text style matching and that a query term 'purple' should match the document with 'black and purple'. See matching documentation https://docs.vespa.ai/en/reference/schema-reference.html#match

    If you define the field colors like this

    field colors type weightedset<string>{
       indexing: summary | index
       match: text #This is default matching for string fields with 'index'
    }
    

    And feed a doc

    "colors": {
       "blue":1,
       "black and purple":1, 
       "green": 1,
       "yellow": 1
    }
    

    You can retrieve and rank using the following query

    {
    "yql": "select * from sources * where colors contains ([{\"weight\":1}]\"purple\") or colors contains ([{\"weight\":2}]\"yellow\");",
    "ranking.profile": "color-ranking"
    }
    

    See query language reference on term weights

    There are the multiple ways you can rank the retrieved documents, but the below assumes you use color ranking as the only ranking signal.

    rank-profile color-ranking {
      function colorMatch() {
         expression: nativeDotProduct(colors)
      }
      first-phase {
       expression: colorMatch()
      } 
    }
    

    Here we use the nativeDotProduct ranking feature which in our example will return the 3 (21 + 11). The term weight and document weight can only be integers, tensors allows floats.

    The elementSimilarity ranking feature is also a candidate and allows more flexibility and you can override if you want to use max/sum and how to combine the element weight and the query term weight.

    If this only a ranking signal you can also use the rank query operator

    {
    "yql": "select * from sources * where rank(foo contains "bar", colors contains ([{\"weight\":1}]\"purple\") or colors contains ([{\"weight\":2}]\"yellow\"));",
    "ranking.profile": "color-ranking"
    }
    

    In the above query we retrieve documents where a field called 'foo' contains 'bar and for those documents the colors field is matched and ranking features are created (depending on which are used in the ranking profile).

    Generally the query is a way to express how to retrieve documents, and the ranking profile determines how you rank those retrieved. The rank query operator is a nice way to be able to create matching ( Q-D interactions) ranking features without impacting recall.

    There are also other more efficient ways including the wand query operator if you want to retrieve efficiently using the inner dot product between something in the query and in the document. See https://docs.vespa.ai/en/using-wand-with-vespa.html