multilabel-classification keyword-search vespa

What's the most efficient way to include Vespa document keywords in ranking at query time?

We have a situation where we want to search text against category documents, which are enriched by a keywords field. These keywords are terms and phrases curated by subject matter experts and GPT.

We want to be able to use queries ranging in length from 1 word to a medium sized paragraph, which will return the most suitable category based (primarily) on the keywords field.

We have the following test setup in our schema:

document category {

        field id type int {
            indexing: summary | attribute
        }

        field title type string {
            indexing: summary | attribute | index
            index: enable-bm25
        }

        field keywords type array<string> {
            indexing: summary | index
        }
}

field title_embedding type tensor<bfloat16>(x[384]) {
        indexing: input title | embed bert | attribute | index
        attribute {
            distance-metric: angular
        }
    }

fieldset default {
        fields: title
    }

We have tried the following profile:

rank-profile bm25_semantic inherits default {
        inputs {
            query(query_embedding) tensor<bfloat16>(x[384])
        }
        first-phase {
            expression: bm25(title) + matches(keywords) + closeness(field, title_embedding)
        }
    }

Together with the following query:

SELECT * FROM category WHERE userQuery() OR rank(keywords contains 'X' OR keywords contains 'Y') OR ({targetHits: 100}nearestNeighbor(title_embedding,query_embedding))

We have been able to get decent results with this configuration, but it's not scalable because:

the keywords contains 'X' argument needs to be repeated potentially many times depending on the query, as 'X' and 'Y' represent tokens in the query. EG: query="Here is a sample text about mice and mosquitos" which would give: "...OR keywords contains 'mice' OR keywords contains 'mosquitos'..."
the keywords and query can be in different languages which will not perform well with basic tokenisation eg: Chinese.

Essentially we're looking for a solution which is the inverse of the contains argument. So instead of keywords contains "text", we need something like "text" contains keywords where the keywords are ideally included in the index.

We are still fairly new to Vespa, so we're not sure the proper approach to handling keywords. Is a different data/field structure setup required to handle this, or can this be done by building out a dedicated rank profile?

Any help would be appreciated!

Solution

"we're looking for a solution which is the inverse of the contains argument."

This operator is called weightedSet: where weightedSet(keywords, {"mice":1, "mosquitos":1})

On queries and keywords in different languages: No tokenization is done on the individual tokens you pass in YQL, for that you need to pass in raw text with userQuery, and then you can control it by setting a grammar parameter.