I have the following schema
schema product {
document product {
field brand_name type string {
indexing: summary | index
}
field brand_name_tokens type array<string> { #computed field
....
}
}
I want to have a field called brand_name_tokens
of type array<string>
in the document which is derived from brand_name
field as follows
brand_name
by white space and then remove substring [™®]
if present.I can do this processing before writing to Vespa. I would like to know if its possible to define this in the schema so that Vespa automatically computes this.
you can do this - there's a "split by regex". There's no "remove substring" but if you don't mind splitting on ™/® then you can do it with a slightly fancier regex.
It looks a little different because you can't input unicode characters directly in the schema so you have to replace ™ with \xe2\x84\xa2
:
field brand_name_tokens type array<string> {
indexing: input brand_name | split "([. ®]|\xe2\x84\xa2)+" | summary
}
As this is a computed field, it should be defined outside the document product {
block.
See https://docs.vespa.ai/en/reference/indexing-language-reference.html#split and the rest of the page for what you can do at indexing time. If you need to do more, you can write a custom Document Processor