Search code examples
elasticsearchtokenizeelasticsearch-analyzers

Elasticsearch implement off-the-shelf language analyser but use custom tokeniser


This may be a duplicate but I've done a bit of searching and found no answer.

I have a simple requirement: I want to use the French (for example) analyser and I simply want to tweak it slightly so that it recognises dot (".") as a token separator.

Due to the endlessly baffling nature of the ES documentation I just can't work out how to do this simple thing: I have managed to devise and apply such a tokeniser ... but then the stop words stop being applied.

The trouble with the relevant ES documentation pages is that they immediately launch into how to implement a fully re-engineered language analyser (giving many examples) but don't explain how to apply a relatively simple "tweak" to a regular language analyser, leaving all other things untouched.

My attempt so far:

SETTINGS = \
{
    "settings": {
        "analysis": {
            "analyzer": {
                "french": {
                    "tokenizer": "dot_and_boundary_tokeniser",
                    "filter": ["lowercase", "stemmer", "stop"]    
                }
            },
            "tokenizer": {
                "dot_and_boundary_tokeniser": {
                    "type": "pattern",
                    "pattern": r"[\.\W]+" # this pattern means that token separation will also occur with a "."
                }
            }
        }
    }
}        

and then:

MAPPINGS = \
{
    "properties": {
        "french_normalised_content": {
            "type": "text",
            "term_vector": "with_positions_offsets",
            "fields": {
                "french_stemmed": {
                    "type": "text",
                    "analyzer": "french",
                    "term_vector": "with_positions_offsets",
                }
            }
        }
    }
}

... when I then search using the "french_normalised_content.french_stemmed" field, I find that words separated by "." are correctly divided ... but that stop words, such as "ou" or "le" included in the query string are (wrongly) not being disregarded, and highlighted results include highlighting of those words.

But here's the most baffling thing of all: my English analyser is configured in exactly the same way... Dots function as token separators. And the stop words are (correctly) ignored.

I'm not quite clear: do my settings as above mean that I am completely "re-inventing" the French analyser? My preference would be just to get hold of the regular French analyser and inject my new tokeniser into it. Is this possible?


Solution

  • do my settings as above mean that I am completely "re-inventing" the French analyser?

    Yes

    My preference would be just to get hold of the regular French analyser and inject my new tokeniser into it. Is this possible?

    You need to redefine the analyzer from scratch, you cannot "inherit" from an existing built-in analyzer.

    Fortunately, the french analyzer documentation contains a full definition that would allow you to quickly recreate it and then modify the tokenizer, which in your case would result in something like this:

    PUT /french_example
    {
      "settings": {
        "analysis": {
          "tokenizer": {
            "dot_and_boundary_tokenizer": {
              "type": "pattern",
              "pattern": "[\\.\\W]+"
            }
          },
          "filter": {
            "french_elision": {
              "type":         "elision",
              "articles_case": true,
              "articles": [
                  "l", "m", "t", "qu", "n", "s",
                  "j", "d", "c", "jusqu", "quoiqu",
                  "lorsqu", "puisqu"
                ]
            },
            "french_stop": {
              "type":       "stop",
              "stopwords":  "_french_" 
            },
            "french_keywords": {
              "type":       "keyword_marker",
              "keywords":   ["Example"] 
            },
            "french_stemmer": {
              "type":       "stemmer",
              "language":   "light_french"
            }
          },
          "analyzer": {
            "rebuilt_french": {
              "tokenizer":  "dot_and_boundary_tokenizer",
              "filter": [
                "french_elision",
                "lowercase",
                "french_stop",
                "french_keywords",
                "french_stemmer"
              ]
            }
          }
        }
      }
    }