Search code examples
elasticsearchlucene

Elasticsearch add synonym at index time based on token content


What I'm trying to do is to have words in french language indexed in multiple forms as synonyms. For example l'ami to be indexed as is plus two synonyms: "lami" and "l ami" so my synonym graph for this word would look something like:

---l---ami--
|          |
---l'ami----
|          |
---lami-----

One could use the Conditional token filter to check if an apostrophe (I'll normalize all apostrophe types with a char filter beforehand) is present in the word and apply a synonym or some kind of filter if this is the case.

Is there a way to dynamically add synonyms at index/query time based on the condition that a certain char is found in the string ?


Solution

  • Your solution is the multiplexer filter. It let filter tokens in various manner

    Mapping with your the condition filter and multiplexer

    PUT /dynamic_synonyms
    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "dynamic_synonym_analyzer": {
                        "tokenizer": "whitespace",
                        "filter": [
                            "lowercase",
                            "elision_detect_filter"
                        ]
                    }
                },
                "filter": {
                    "dynamic_synonym_filter": {
                        "type": "multiplexer",
                        "filters": [
                            "apostroph_remove_filter",
                            "lowercase",
                            "apostroph_space_replace_filter"
                        ]
                    },
                    "apostroph_space_replace_filter": {
                        "type": "pattern_replace",
                        "pattern": "'",
                        "replacement": " "
                    },
                    "apostroph_remove_filter": {
                        "type": "pattern_replace",
                        "pattern": "'",
                        "replacement": ""
                    },
                    "elision_detect_filter": {
                        "type": "condition",
                        "filter": [
                            "dynamic_synonym_filter"
                        ],
                        "script": {
                            "source": """token.term.toString().startsWith('l\'')"""
                        }
                    }
                }
            }
        }
    }
    

    The lowercase filter in dynamic_synonym_filter is a noop filter

    Analyzing string

    POST /dynamic_synonyms/_analyze
    {
        "analyzer" : "dynamic_synonym_analyzer",
        "text" : "l'ami bon"
    }
    

    Response

    {
        "tokens" : [
            {
                "token" : "l'ami",
                "start_offset" : 0,
                "end_offset" : 5,
                "type" : "word",
                "position" : 0
            },
            {
                "token" : "lami",
                "start_offset" : 0,
                "end_offset" : 5,
                "type" : "word",
                "position" : 0
            },
            {
                "token" : "l ami",
                "start_offset" : 0,
                "end_offset" : 5,
                "type" : "word",
                "position" : 0
            },
            {
                "token" : "bon",
                "start_offset" : 6,
                "end_offset" : 9,
                "type" : "word",
                "position" : 1
            }
        ]
    }