What I'm trying to do is to have words in french language indexed in multiple forms as synonyms. For example l'ami to be indexed as is plus two synonyms: "lami" and "l ami" so my synonym graph for this word would look something like:
---l---ami--
| |
---l'ami----
| |
---lami-----
One could use the Conditional token filter to check if an apostrophe (I'll normalize all apostrophe types with a char filter beforehand) is present in the word and apply a synonym or some kind of filter if this is the case.
Is there a way to dynamically add synonyms at index/query time based on the condition that a certain char is found in the string ?
Your solution is the multiplexer
filter. It let filter tokens in various manner
Mapping with your the condition
filter and multiplexer
PUT /dynamic_synonyms
{
"settings": {
"analysis": {
"analyzer": {
"dynamic_synonym_analyzer": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"elision_detect_filter"
]
}
},
"filter": {
"dynamic_synonym_filter": {
"type": "multiplexer",
"filters": [
"apostroph_remove_filter",
"lowercase",
"apostroph_space_replace_filter"
]
},
"apostroph_space_replace_filter": {
"type": "pattern_replace",
"pattern": "'",
"replacement": " "
},
"apostroph_remove_filter": {
"type": "pattern_replace",
"pattern": "'",
"replacement": ""
},
"elision_detect_filter": {
"type": "condition",
"filter": [
"dynamic_synonym_filter"
],
"script": {
"source": """token.term.toString().startsWith('l\'')"""
}
}
}
}
}
}
The lowercase filter in dynamic_synonym_filter
is a noop filter
Analyzing string
POST /dynamic_synonyms/_analyze
{
"analyzer" : "dynamic_synonym_analyzer",
"text" : "l'ami bon"
}
Response
{
"tokens" : [
{
"token" : "l'ami",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "lami",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "l ami",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "bon",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 1
}
]
}