Search code examples
elasticsearchquery-analyzer

Custom analyzer, use case : zip-code [ElasticSearch]


Let be a set index/type named customers/customer. Each document of this set has a zip-code as property. Basically, a zip-code can be like:

  • String-String (ex : 8907-1009)
  • String String (ex : 211-20)
  • String (ex : 30200)

I'd like to set my index analyzer to get as many documents as possible that could match. Currently, I work like that :

PUT /customers/
{
"mappings":{
    "customer":{
        "properties":{
             "zip-code": {
                  "type":"string"
                  "index":"not_analyzed"
              }
              some string properties ...
         }
     }
 }

When I search a document I'm using that request :

GET /customers/customer/_search
{
  "query":{
    "prefix":{
      "zip-code":"211-20"
     }
   }
}

That works if you want to search rigourously. But for instance if the zip-code is "200 30", then searching with "200-30" will not give any results. I'd like to give orders to my index analyser in order to don't have this problem. Can someone help me ? Thanks.

P.S. If you want more information, please let me know ;)


Solution

  • As soon as you want to find variations you don't want to use not_analyzed.

    Let's try this with a different mapping:

    PUT zip
    {
      "settings": {
        "number_of_shards": 1, 
        "analysis": {
          "analyzer": {
            "zip_code": {
              "tokenizer": "standard",
              "filter": [ ]
            }
          }
        }
      },
      "mappings": {
        "_doc": {
          "properties": {
            "zip": {
              "type": "text",
              "analyzer": "zip_code"
            }
          }
        }
      }
    }
    

    We're using the standard tokenizer; strings will be broken up at whitespaces and punctuation marks (including dashes) into tokens. You can see the actual tokens if you run the following query:

    POST zip/_analyze
    {
      "analyzer": "zip_code",
      "text": ["8907-1009", "211-20", "30200"]
    }
    

    Add your examples:

    POST zip/_doc
    {
      "zip": "8907-1009"
    }
    POST zip/_doc
    {
      "zip": "211-20"
    }
    POST zip/_doc
    {
      "zip": "30200"
    }
    

    Now the query seems to work fine:

    GET zip/_search
    {
      "query": {
        "match": {
          "zip": "211-20"
        }
      }
    }
    

    This will also work if you just search for "211". However, this might be too lenient, since it will also find "20", "20-211", "211-10",...

    What you probably want is a phrase search where all the tokens in your query need to be in the field and also in the right order:

    GET zip/_search
    {
      "query": {
        "match_phrase": {
          "zip": "211"
        }
      }
    }
    

    Addition:

    If the ZIP codes have a hierarchical meaning (if you have "211-20" you want this to be found when searching for "211", but not when searching for "20"), you can use the path_hierarchy tokenizer.

    So changing the mapping to this:

    PUT zip
    {
      "settings": {
        "number_of_shards": 1, 
        "analysis": {
          "analyzer": {
            "zip_code": {
              "tokenizer": "zip_tokenizer",
              "filter": [ ]
            }
          },
          "tokenizer": {
            "zip_tokenizer": {
              "type": "path_hierarchy",
              "delimiter": "-"
            }
          }
        }
      },
      "mappings": {
        "_doc": {
          "properties": {
            "zip": {
              "type": "text",
              "analyzer": "zip_code"
            }
          }
        }
      }
    }
    

    Using the same 3 documents from above you can use the match query now:

    GET zip/_search
    {
      "query": {
        "match": {
          "zip": "1009"
        }
      }
    }
    

    "1009" won't find anything, but "8907" or "8907-1009" will.

    If you want to also find "1009", but with a lower score, you'll have to analyze the zip code with both variations I have shown (combine the 2 versions of the mapping):

    PUT zip
    {
      "settings": {
        "number_of_shards": 1, 
        "analysis": {
          "analyzer": {
            "zip_hierarchical": {
              "tokenizer": "zip_tokenizer",
              "filter": [ ]
            },
              "zip_standard": {
              "tokenizer": "standard",
              "filter": [ ]
            }
          },
          "tokenizer": {
            "zip_tokenizer": {
              "type": "path_hierarchy",
              "delimiter": "-"
            }
          }
        }
      },
      "mappings": {
        "_doc": {
          "properties": {
            "zip": {
              "type": "text",
              "analyzer": "zip_standard",
              "fields": {
                "hierarchical": {
                  "type": "text",
                  "analyzer": "zip_hierarchical"
                }
              }
            }
          }
        }
      }
    }
    

    Add a document with the inverse order to properly test it:

    POST zip/_doc
    {
      "zip": "1009-111"
    }
    

    Then search both fields, but boost the one with the hierarchical tokenizer by 3:

    GET zip/_search
    {
      "query": {
        "multi_match" : {
          "query" : "1009",
          "fields" : [ "zip", "zip.hierarchical^3" ] 
        }
      }
    }
    

    Then you can see that "1009-111" has a much higher score than "8907-1009".