Search code examples
elasticsearchelasticsearch-analyzerselasticsearch-1.7.5

Elasticsearch analyze API shows wrong tokens with 1.X version when used according to 7.X syntax


While working on one of the user's queries, where initially I assumed he was using the latest version and when he showed analyze API, it was surprising.

Custom analyzer for which tokens needs to be checked

{
    "settings": {
        "analysis": {
            "filter": {
                "splcharfilter": {
                    "type": "pattern_capture",
                    "preserve_original": true,
                    "patterns": [
                        "([?/])"
                    ]
                }
            },
            "analyzer": {
                "splcharanalyzer": {
                    "tokenizer": "keyword",
                    "filter": [
                        "splcharfilter",
                        "lowercase"
                    ]
                }
            }
        }
    }
}

Analyze API

POST /_analyze

{
    "analyzer": "splcharanalyzer",
    "text" : "And/or"
}

Output

{
    "tokens": [
        {
            "token": "analyzer", --> why this token
            "start_offset": 7,
            "end_offset": 15,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "splcharanalyzer", --> why this token
            "start_offset": 19,
            "end_offset": 34,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "text", --> why this token
            "start_offset": 42,
            "end_offset": 46,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "and",
            "start_offset": 51,
            "end_offset": 54,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "or",
            "start_offset": 58,
            "end_offset": 60,
            "type": "<ALPHANUM>",
            "position": 5
        }
    ]
}

As its clearly shown above its generating so many tokens which are not correct, when checked user mentioned he was using 1.7 version and followed the syntax provided in the latest version of elasticsearch.


Solution

  • As Elasticsearch 1.X version is quite old, and Elasticsearch by default opens the latest version of API and wondering the importance of analyze API for troubleshooting so many issues in Elasticsearch I am posting here the correct syntax of 1.X version, Hope this would help other old version users of Elasticsearch.

    Elasticsearch 1.X analyze API documentation can be found here and below are the correct tokens generated for text mentioned in the question.

      GET  /_analyze?analyzer=splcharanalyzer&text=And/or --> note its GET request
    

    Correct tokens generated for And/or with the analyzer posted in question for 1.X

    {
        "tokens": [
            {
                "token": "and/or",
                "start_offset": 0,
                "end_offset": 6,
                "type": "word",
                "position": 1
            },
            {
                "token": "/",
                "start_offset": 0,
                "end_offset": 6,
                "type": "word",
                "position": 1
            }
        ]
    }