Search code examples
elasticsearchfiltersynonym

Elasticsearch how synonym token filter works if synonym is multi-word?


Can somebody explain me please how synonym token filter works if synonym is multi-word expression and tokenizer is whitespace? E.g. if I have this simple mapping

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "whitespace",
                        "filter" : ["synonym"]
                    }
                },
                "filter" : {
                    "synonym_graph" : {
                        "type" : "synonym",
                        "lenient": true,
                        "synonyms" : ["multi word, bar => baz"]
                    }
                }
            }
        }
    }
}

I dont understand how is possible to evaluate term multi word if whitespace tokenizer breaks it in to two words multi and word. So as I understand it synonym filter never gets "multi word" as one term to find synonyms in configuration. Any help is appreciated.


Solution

  • The answer can be found in this section

    https://www.elastic.co/guide/en/elasticsearch/reference/7.6/token-graphs.html

    and this blog post

    http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

    Some token filters can add tokens that span multiple positions. These can include tokens for multi-word synonyms, such as using "atm" as a synonym for "automatic teller machine". However, only some token filters, known as graph token filters, accurately record the positionLength for multi-position tokens.

    Indexing ignores the positionLength attribute and does not support token graphs containing multi-position tokens. However, queries, such as the match or match_phrase query, can use these graphs to generate multiple sub-queries from a single query string.

    The following token filters can add tokens that span multiple positions but only record a default positionLength of 1:
    
    - synonym
    - word_delimiter
    
    This means these filters will produce invalid token graphs for streams containing such tokens.
    
    Avoid using invalid token graphs for search. Invalid graphs can cause unexpected search results.