Search code examples
elasticsearchelasticsearch-mapping

Searching for hyphened text in Elasticsearch


I am storing a 'Payment Reference Number' in elasticsearch.

The layout of it is e.g.: 2-4-3-635844569819109531 or 2-4-2-635844533758635433 etc

I want to be able to search for documents by their payment ref number either by

  1. Searching using the 'whole' reference number, e.g. putting in 2-4-2-635844533758635433
  2. Any 'part' of the reference number from the 'start'. E.g. 2-4-2-63 (.. so only return the second one in the example)

Note: i do not want to search 'in the middle' or 'at the end' etc. From the beginning only.

Anyways, the hyphens are confusing me.

Questions

1) I am not sure if I should remove them in the mapping like

"char_filter" : {
    "removeHyphen" : {
        "type" : "mapping",
            "mappings" : ["-=>"]
        }
    },

or not. I have never use the mappings in that way so not sure if this is necessary.

2) I think I need a 'ngrams' filter because I want to be able to search a part of the reference number from the being. I think something like

"partial_word":{
    "filter":[
        "standard",
            "lowercase",
            "name_ngrams"
        ],
    "type":"custom",
    "tokenizer":"whitespace"
},

and the filter

"name_ngrams":{
    "side":"front",
        "max_gram":50,
        "min_gram":2,
    "type":"edgeNGram"
},

I am not sure how to put it all together but

"paymentReference":{
    "type":"string",
    "analyzer": "??",
    "fields":{
        "partial":{
            "search_analyzer":"???",
            "index_analyzer":"partial_word",
            "type":"string"
        }
    }
}

Everything that I have tried seems to always 'break' in the second search case.

If I do 'localhost:9200/orders/_analyze?field=paymentReference&pretty=1' -d "2-4-2-635844533758635433" it always breaks the hyphen as it's own token and returns e.g. all documents with 2- which is 'alot'! and not what I want when searching for 2-4-2-6

Can someone tell me how to map this field for the two types of searches I am trying to achieve?

Update - Answer

Effectively what Val said below. I just changed the mapping slightly to be more specific re the analyzers and also I don't need the main string indexed because I just query the partial.

Mapping

"paymentReference":{
    "type": "string",
    "index":"not_analyzed",
    "fields": {
        "partial": {
            "search_analyzer":"payment_ref",
            "index_analyzer":"payment_ref",
            "type":"string"
        }
    }
}

Analyzer

"payment_ref": {
    "type": "custom",
    "filter": [
        "lowercase",
        "name_ngrams"
    ],
    "tokenizer": "keyword"
}

Filter

"name_ngrams":{
    "side":"front",
    "max_gram":50,
    "min_gram":2,
    "type":"edgeNGram"
},

Solution

  • You don't need to use the mapping char filter for this.

    You're on the right track using the Edge NGram token filter since you need to be able to search for prefixes only. I would use a keyword tokenizer instead to make sure the term is taken as a whole. So the way to set this up is like this:

    curl -XPUT localhost:9200/orders -d '{
      "settings": {
        "analysis": {
          "analyzer": {
            "partial_word": {
              "type": "custom",
              "filter": [
                "lowercase",
                "ngram_filter"
              ],
              "tokenizer": "keyword"
            }
          },
          "filter": {
            "ngram_filter": {
              "type": "edgeNGram",
              "min_gram": 2,
              "max_gram": 50
            }
          }
        }
      },
      "mappings": {
        "order": {
          "properties": {
            "paymentReference": {
              "type": "string",
              "fields": {
                "partial": {
                  "analyzer": "partial_word",
                  "type": "string"
                }
              }
            }
          }
        }
      }
    }'
    

    Then you can analyze what is going to be indexed into your paymentReference.partial field:

    curl -XGET 'localhost:9205/payments/_analyze?field=paymentReference.partial&pretty=1' -d "2-4-2-635844533758635433"
    

    And you get exactly what you want, i.e. all the prefixes:

    {
      "tokens" : [ {
        "token" : "2-",
        "start_offset" : 0,
        "end_offset" : 24,
        "type" : "word",
        "position" : 1
      }, {
        "token" : "2-4",
        "start_offset" : 0,
        "end_offset" : 24,
        "type" : "word",
        "position" : 1
      }, {
        "token" : "2-4-",
        "start_offset" : 0,
        "end_offset" : 24,
        "type" : "word",
        "position" : 1
      }, {
        "token" : "2-4-2",
        "start_offset" : 0,
        "end_offset" : 24,
        "type" : "word",
        "position" : 1
      }, {
        "token" : "2-4-2-",
        "start_offset" : 0,
        "end_offset" : 24,
        "type" : "word",
        "position" : 1
      }, {
        "token" : "2-4-2-6",
        "start_offset" : 0,
        "end_offset" : 24,
        "type" : "word",
        "position" : 1
      }, {
        "token" : "2-4-2-63",
        "start_offset" : 0,
        "end_offset" : 24,
        "type" : "word",
        "position" : 1
      }, {
        "token" : "2-4-2-635",
        "start_offset" : 0,
        "end_offset" : 24,
        "type" : "word",
        "position" : 1
      }, {
        "token" : "2-4-2-6358",
        "start_offset" : 0,
        "end_offset" : 24,
        "type" : "word",
        "position" : 1
      }, {
        "token" : "2-4-2-63584",
        "start_offset" : 0,
        "end_offset" : 24,
        "type" : "word",
        "position" : 1
      }, {
      ...
    

    Finally you can search for any prefix:

    curl -XGET localhost:9200/orders/order/_search?q=paymentReference.partial:2-4-3