Search code examples
elasticsearchcamelcasing

Exclude from CamelCase tokenizer in Elasticsearch


Struggling to make iPhone match when searching for iphone in Elasticsearch.

Since I have some source code at stake, I surely need CamelCase tokenizer, but it appears to break iPhone into two terms, so iphone can't be found.

Anyone knows of a way to add exceptions to breaking camelCase words into tokens (camel + case)?

UPDATE: to make it clear, I want NullPointerException to be tokenized as [null, pointer, exception], but I don't want iPhone to become [i, phone].

Any other solution?

UPDATE 2: @ChintanShah's answer suggests a different approach that gives us even more - NullPointerException will be tokenized as [null, pointer, exception, nullpointer, pointerexception, nullpointerexception], which is definitely much more useful from the point of view of the one that searches. And indexing is also faster! Price to pay is index size, but it is a superior solution.


Solution

  • You can achieve your requirements with word_delimiter token filter. This is my setup

    {
      "settings": {
        "analysis": {
          "analyzer": {
            "camel_analyzer": {
              "tokenizer": "whitespace",
              "filter": [
                "camel_filter",
                "lowercase",
                "asciifolding"
              ]
            }
          },
          "filter": {
            "camel_filter": {
              "type": "word_delimiter",
              "generate_number_parts": false,
              "stem_english_possessive": false,
              "split_on_numerics": false,
              "protected_words": [
                "iPhone",
                "WiFi"
              ]
            }
          }
        }
      },
      "mappings": {
      }
    }
    

    This will split the words on case changes so NullPointerException will be tokenized as null, pointer and exception but iPhone and WiFi will remain as it is as they are protected. word_delimiter has lot of options for flexibility. You can also preserve_original which will help you a lot.

    GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer
    

    Result

    {
       "tokens": [
          {
             "token": "iphone",
             "start_offset": 0,
             "end_offset": 6,
             "type": "word",
             "position": 1
          }
       ]
    }
    

    Now with

    GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer
    

    Result

    {
       "tokens": [
          {
             "token": "null",
             "start_offset": 0,
             "end_offset": 4,
             "type": "word",
             "position": 1
          },
          {
             "token": "pointer",
             "start_offset": 4,
             "end_offset": 11,
             "type": "word",
             "position": 2
          },
          {
             "token": "exception",
             "start_offset": 11,
             "end_offset": 20,
             "type": "word",
             "position": 3
          }
       ]
    }
    

    Another approach is to analyze your field twice with different analyzers but I feel word_delimiter will do the trick.

    Does this help?