Search code examples
elasticsearchaggregationmappings

Elasticsearch: index first char of string


I'm using version 5.3.

I have a text field a. I'd like to aggregate on the first char of a. I also need the entire original value.

I'm assuming the most efficient way is to have a keyword field a.firstLetter with a custom normalizer. I've tried to achieve this with a pattern replace char filter but am struggling with the regexp.

Am I going at this entirely wrong? Can you help me?

EDIT

This is what I've tried.

settings.json

{
  "settings":  {
    "index": {
      "analysis": {
        "char_filter": {
          "first_char": {
            "type": "pattern_replace",
            "pattern": "(?<=^.)(.*)",
            "replacement": ""
          }
        }
        "normalizer": {
          "first_letter": {
            "type": "custom",
            "char_filter": ["first_char"]
            "filter": ["lowercase"]
          }
        }
      }
    }
  }
}

mappings.json

{
  "properties": {
    "a": {
      "type": "text",
      "index_options": "positions",
      "fields": {
        "firstLetter": {
          "type": "keyword",
          "normalizer": "first_letter"
        }
      }
    }
  }
}

I get no buckets when I try to aggregate like so:

"aggregations": {
  "grouping": {
    "terms": {
      "field": "a.firstLetter"
    }
  }
}

So basically my approach was "replace all but the first char with an empty string." The regexp is something I was able to gather by googling.

EDIT 2 I had misconfigured the normalizer (I've fixed the examples). The correct configuration reveals that normalizers do not support pattern replace char filters due to issue 23142. Apparently support for it will be implemented earliest in version 5.4.

So are there any other options? I'd hate to do this in code, by adding a field in the doc for the first letter, since I'm using Elasticsearch features for every other aggregation.


Solution

  • You can use the truncate filter with a length of one

    PUT foo
    {
      "mappings": {
        "bar" : {
          "properties": {
            "name" : {
              "type": "text",
              "analyzer": "my_analyzer"
            }
          }
        }
      }, 
      "settings":  {
        "index": {
          "analysis": {
            "analyzer" : {
              "my_analyzer" : {
                "type" : "custom",
                "tokenizer" : "keyword",
                "filter" : [ "my_filter", "lowercase" ]
              }
            },
            "filter": {
              "my_filter": {
                "type": "truncate",
                "length": 1
              }
            }
          }
        }
      }
    }
    
    GET foo/_analyze
    {
      "field" : "name",
      "text" : "New York"
    }
    
    # response
    {
      "tokens": [
        {
          "token": "n",
          "start_offset": 0,
          "end_offset": 8,
          "type": "word",
          "position": 0
        }
      ]
    }