Search code examples
elasticsearch

Elastic Search - search the data ignoring periods or


The elastic search index has the data having CPFs.

{
  "name": "A",
  "cpf": "718.881.683-23",
}

{
  "name": "B",
  "cpf": "404.833.187-60",
}

I want to search the data by field cpf as following:

query: 718
output: doc with name "A"
query: 718.881.683-23
output: doc with name "A"

The above is working.

But the following is not working.

query: 71888168323
output: doc with name "A"

Here I want to search the doc by field CPF data but without period and hyphen also.


Solution

  • You can add a custom analyzer that will remove all characters that are not digits and only index the digits.

    The analyzer looks like this:

    PUT test
    {
      "settings": {
        "analysis": {
          "filter": {
            "number_only": {
              "type": "pattern_replace",
              "pattern": "\\D"
            }
          },
          "analyzer": {
            "cpf_analyzer": {
              "type": "custom",
              "tokenizer": "keyword",
              "filter": [
                "number_only"
              ]
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "cpf": {
            "type": "text",
            "analyzer": "cpf_analyzer"
          }
        }
      }
    }
    

    Then you can index your documents as usual:

    POST test/_doc
    {
      "name": "A",
      "cpf": "718.881.683-23"
    }
    
    POST test/_doc
    {
      "name": "B",
      "cpf": "404.833.187-60"
    }
    

    Searching for a prefix like 718 can be done like this:

    POST test/_search
    {
      "query": {
        "prefix": {
          "cpf": "718"
        }
      }
    }
    

    Searching for the exact value with non-digit characters can be done like this:

    POST test/_search
    {
      "query": {
        "match": {
          "cpf": "718.881.683-23"
        }
      }
    }
    

    And finally, you can also search with numbers only:

    POST test/_search
    {
      "query": {
        "match": {
          "cpf": "71888168323"
        }
      }
    }
    

    With the given analyzer, all the above queries will return the document you expect.

    If you cannot recreate your index for whatever reason, you can create a sub-field with the right analyzer and update your data in place:

    PUT test/_mapping
    {
      "properties": {
        "cpf": {
          "type": "text",
          "fields": {
            "numeric": {
              "type": "text",
              "analyzer": "cpf_analyzer"
            }
          }
        }
      }
    }
    

    And then simply run the following command which will reindex all the data in place and populate the cpf.numeric field:

    POST test/_update_by_query
    

    All your searches will then need to be done on the cpf.numeric field instead of cpf directly.