Search code examples
elasticsearchelasticsearch-analyzers

Elasticsearch analyzer to remove quoted sentences


I'm trying to create an analyzer that would remove (or replace by white/empty space) a quoted sentence within a document.

Such as: this is my \"test document\"

I'd like, for example, the term vector to be: [this, is, my]


Solution

  • Daniel Answer is correct, but as corresponding regex and replacement are missing, I am providing it, which includes the test of your text.

    Index setting as below which uses pattern replace char.

    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "my_analyzer": {
                        "tokenizer": "standard",
                        "char_filter": [
                            "my_char_filter"
                        ],
                        "filter": [
                            "lowercase"
                        ]
                    }
                },
                "char_filter": {
                    "my_char_filter": {
                        "type": "pattern_replace",
                        "pattern": "\"(.*?)\"",
                        "replacement": ""
                    }
                }
            }
        }
    }
    

    After that using analyze API it generates below tokens:

    POST _analyze

    {
        "text": "this is my \"test document\"",
        "analyzer" : "my_analyzer"
    }
    

    Output of above API:

    {
        "tokens": [
            {
                "token": "this",
                "start_offset": 0,
                "end_offset": 4,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
                "token": "is",
                "start_offset": 5,
                "end_offset": 7,
                "type": "<ALPHANUM>",
                "position": 1
            },
            {
                "token": "my",
                "start_offset": 8,
                "end_offset": 10,
                "type": "<ALPHANUM>",
                "position": 2
            }
        ]
    }