How to apply custom analyzers on a field in Vespa schema

I have the following schema

schema product {
    document product {
        field brand_name type string {
            indexing: summary | index
        }
        field brand_name_tokens type array<string> {    #computed field
            ....
        }
}

I want to have a field called brand_name_tokens of type array<string> in the document which is derived from brand_name field as follows

split brand_name by white space and then remove substring [™®] if present.

I can do this processing before writing to Vespa. I would like to know if its possible to define this in the schema so that Vespa automatically computes this.

Solution

you can do this - there's a "split by regex". There's no "remove substring" but if you don't mind splitting on ™/® then you can do it with a slightly fancier regex.

It looks a little different because you can't input unicode characters directly in the schema so you have to replace ™ with \xe2\x84\xa2:

field brand_name_tokens type array<string> {
    indexing: input brand_name | split "([. ®]|\xe2\x84\xa2)+" | summary
}

As this is a computed field, it should be defined outside the document product { block.

See https://docs.vespa.ai/en/reference/indexing-language-reference.html#split and the rest of the page for what you can do at indexing time. If you need to do more, you can write a custom Document Processor