Search code examples
regexelasticsearchelasticsearch-queryelasticsearch-analyzers

ElasticSearch analyzer auto-complete feature for alphanumeric


I have alphanumeric codes like Hcc18, HCC23, I23, which I want to store in ElasticSearch. Over this I want to build following two features:-

  1. User can search complete alphanumeric code or just the integer part.
    Example: for hcc15 or 15, hcc15 should be in the output and on the top of the results.
  2. Autocomplete feature: When the user type let's say I42 the results should contain I420, I421 and so on.

My Elasticsearch current mapping is:

"mappings": {
  "properties": {
    "code": {
      "type": "text",
      "analyzer": "autoanalyer"
    }
  }
},
"settings": {
  "analysis": {
    "analyzer": {
      "autoanalyer": {
        "tokenizer": "standard",
        "filter": [
          "lowercase",
        ]
      }
    },
    "tokenizer": {
      "autotoken": {
        "type": "simple_pattern",
        "pattern": "[0-9]+"
      }
    }
  }
}

Query being made:

{
    "min_score": 0.1,
    "from": 0,
    "size": 10000,
    "query": {
        "bool": {
            "should": [{ "match": {"code": search_term}}]
        }
    }
}

Two problems, I am facing with this approach is:-

  1. Let's say I search for I420, now because mapping is based only on digits, I am getting all the codes related to number 420, but the exact match I420 isn't coming on the top.

  2. Will this mapping how will I be able to achieve the above mentioned Autocomplete feature.


Solution

  • You had multiple requirements and all these can be achieved using

    1. Creating a custom analyzer that tokenizes data according to our requirements.
    2. Using a bool query with the combination of the prefix (for autocomplete) and match for number search.

    Below is the step by step example, using the OP data and queries.

    Index Def

    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "my_analyzer": {
                        "tokenizer": "autotoken" -->used your analyzer to extract numbers
                    }
                },
                "tokenizer": {
                    "autotoken": {
                        "type": "simple_pattern",
                        "pattern": "[0-9]+",
                        "preserve_original": true
                    }
                }
            }
        },
        "mappings": {
            "properties": {
                "code": {
                    "type": "keyword",
                    "fields": {
                        "number": {
                            "type": "text",
                            "analyzer" : "my_analyzer"
                        }
                    }
                }
            }
        }
    }
    

    Index few docs

    {
      "code" : "hcc420"
    }
    
    {
      "code" : "HCC23"
    }
    
    {
      "code" : "I23"
    }
    
    {
      "code" : "I420"
    }
    
    {
      "code" : "I421"
    }
    
    {
      "code" : "hcc420"
    }
    

    Search query (issue 1, searching for I420, should bring 2 docs in sample data I420 and hcc420 but I420 must have more score as exact match)

    {
        "query": {
            "bool": {
                "should": [
                    {
                        "prefix": {
                            "code": {
                                "value": "I420"
                            }
                        }
                    },
                    {
                        "match": {
                            "code.number": "I420"
                        }
                    }
                ]
            }
        }
    }
    

    Result

    "hits": [
          {
            "_index": "so_number",
            "_type": "_doc",
            "_id": "4",
            "_score": 2.0296195, --> note exact match having high score
            "_source": {
              "code": "I420"
            }
          },
          {
            "_index": "so_number",
            "_type": "_doc",
            "_id": "7",
            "_score": 1.0296195,
            "_source": {
              "code": "hcc420"
            }
          }
        ]
    

    Part 2: The same search query can be used autocomplete feature

    So searching for I42 must bring I420 and I421 from sample docs

    {
        "query": {
            "bool": {
                "should": [
                    {
                        "prefix": {
                            "code": {
                                "value": "I42"
                            }
                        }
                    },
                    {
                        "match": {
                            "code.number": "I42"
                        }
                    }
                ]
            }
        }
    }
    

    Result

     "hits": [
          {
            "_index": "so_number",
            "_type": "_doc",
            "_id": "4",
            "_score": 1.0,
            "_source": {
              "code": "I420"
            }
          },
          {
            "_index": "so_number",
            "_type": "_doc",
            "_id": "5",
            "_score": 1.0,
            "_source": {
              "code": "I421"
            }
          }
        ]
    

    Let's take another example for number search, searching for 420 must bring hcc420 and I420

    Search query

     {
            "query": {
                "bool": {
                    "should": [
                        {
                            "prefix": {
                                "code": {
                                    "value": "420"
                                }
                            }
                        },
                        {
                            "match": {
                                "code.number": "420"
                            }
                        }
                    ]
                }
            }
        }
    
    And whoa, again it gave expected results 😀
    
    Result
    ------
    
    
     "hits": [
          {
            "_index": "so_number",
            "_type": "_doc",
            "_id": "4",
            "_score": 1.0296195,
            "_source": {
              "code": "I420"
            }
          },
          {
            "_index": "so_number",
            "_type": "_doc",
            "_id": "7",
            "_score": 1.0296195,
            "_source": {
              "code": "hcc420"
            }
          }
        ]