Search code examples
elasticsearchelasticsearch-5elasticsearch-plugin

removing special characters and words from a url elasticsearch


I was looking for a way to generate words and special characters as tokens from a url.

eg. I have a url https://www.google.com/

I want to generate tokens in elastic as https, www,google, com, :, /, /, ., ., /


Solution

  • You can define custom analyzer with letter tokenizer as shown below:

    PUT index3
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_email": {
              "tokenizer": "letter",
              "filter": [
                "lowercase"       
              ]
            }
          }
        }
      }
    }
    

    Test API:

    POST index3/_analyze
    {
      "text": [
        "https://www.google.com/"
      ],
      "analyzer": "my_email"
      
    }
    

    Output:

    {
      "tokens" : [
        {
          "token" : "https",
          "start_offset" : 0,
          "end_offset" : 5,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "www",
          "start_offset" : 8,
          "end_offset" : 11,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "google",
          "start_offset" : 12,
          "end_offset" : 18,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "com",
          "start_offset" : 19,
          "end_offset" : 22,
          "type" : "word",
          "position" : 3
        }
      ]
    }