Search code examples
elasticsearchelasticsearch-mapping

Fields not getting sorted in alphabetical order in elasticsearch


I have a few documents with the a name field in it. I am using analyzed version of the name field for search and not_analyzed for sorting purposes. The sorting happens in one level, that is the names are sorted alphabetically at first. But within the list of an alphabet, the names are getting sorted lexicographically rather than alphabetically. Here is the mapping I have used:

{
  "mappings": {
    "seing": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }

Can anyone provide a solution for the same?


Solution

  • Digging down into Elasticsearch documents, I stumbled upon this:

    Case-Insensitive Sorting

    Imagine that we have three user documents whose name fields contain Boffey, BROWN, and bailey, respectively. First we will apply the technique described in String Sorting and Multifields of using a not_analyzed field for sorting:

    PUT /my_index
    {
      "mappings": {
        "user": {
          "properties": {
            "name": {                    //1
              "type": "string",
              "fields": {
                "raw": {                 //2
                  "type":  "string",
                  "index": "not_analyzed"
                }
              }
            }
          }
        }
      }
    }
    
    1. The analyzed name field is used for search.
    2. The not_analyzed name.raw field is used for sorting.

    The preceding search request would return the documents in this order: BROWN, Boffey, bailey. This is known as lexicographical order as opposed to alphabetical order. Essentially, the bytes used to represent capital letters have a lower value than the bytes used to represent lowercase letters, and so the names are sorted with the lowest bytes first.

    That may make sense to a computer, but doesn’t make much sense to human beings who would reasonably expect these names to be sorted alphabetically, regardless of case. To achieve this, we need to index each name in a way that the byte ordering corresponds to the sort order that we want.

    In other words, we need an analyzer that will emit a single lowercase token:

    Following this logic, instead of storing raw document, you need to lowercase it using custom keyword analyzer:

    PUT /my_index
    {
      "settings" : {
        "analysis" : {
          "analyzer" : {
            "case_insensitive_sort" : {
              "tokenizer" : "keyword",
              "filter" : ["lowercase"]
            }
          }
        }
      },
      "mappings" : {
        "seing" : {
          "properties" : {
            "name" : {
              "type" : "string",
              "fields" : {
                "raw" : {
                  "type" : "string",
                  "analyzer" : "case_insensitive_sort"
                }
              }
            }
          }
        }
      }
    }
    

    Now ordering by name.raw should sort in alphabetical order, rather than lexicographical.

    Quick test done on my local machine using Marvel:

    Index structure:

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "case_insensitive_sort": {
              "tokenizer": "keyword",
              "filter": [
                "lowercase"
              ]
            }
          }
        }
      },
      "mappings": {
        "user": {
          "properties": {
            "name": {
              "type": "string",
              "fields": {
                "raw": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "keyword": {
                  "type": "string",
                  "analyzer": "case_insensitive_sort"
                }
              }
            }
          }
        }
      }
    }
    

    Test data:

    PUT /my_index/user/1
    {
      "name": "Tim"
    }
    
    PUT /my_index/user/2
    {
      "name": "TOM"
    }
    

    Query using raw field:

    POST /my_index/user/_search
    {
      "sort": "name.raw"
    }
    

    Result:

    {
      "_index" : "my_index",
      "_type" : "user",
      "_id" : "2",
      "_score" : null,
      "_source" : {
        "name" : "TOM"
      },
      "sort" : [
        "TOM"
      ]
    },
    {
      "_index" : "my_index",
      "_type" : "user",
      "_id" : "1",
      "_score" : null,
      "_source" : {
        "name" : "Tim"
      },
      "sort" : [
        "Tim"
      ]
    }
    

    Query using lowercased string:

    POST /my_index/user/_search
    {
      "sort": "name.keyword"
    }
    

    Result:

    {
      "_index" : "my_index",
      "_type" : "user",
      "_id" : "1",
      "_score" : null,
      "_source" : {
        "name" : "Tim"
      },
      "sort" : [
        "tim"
      ]
    },
    {
      "_index" : "my_index",
      "_type" : "user",
      "_id" : "2",
      "_score" : null,
      "_source" : {
        "name" : "TOM"
      },
      "sort" : [
        "tom"
      ]
    }
    

    I'm suspecting that second result is correct in your case.