Search code examples
elasticsearchelasticsearch-pluginquery-analyzer

Custom analyzer with large char_filter list creation for elasticsearch


I try to add custom analyzer to elastic search. I got a too large "mappings" list of synonyms (mapper_list). Size of mapper_list is about 30.000 elements.

requests.post(es_host + '/_close')

settings = {
    "settings" : {
        "analysis" : {
            "char_filter" : {
                "my_mapping" : {
                    "type" : "mapping",
                    "mappings" : mapper_list
                }
            },
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "char_filter" : ["my_mapping"]
                }
            }
        }
    }
}

requests.put(es_host + '/_settings',
             data=json.dumps(settings))

requests.post(es_host + '/_open')

Error messege from elasetic search

[test-index] IndexCreationException[failed to create index]; nested: ArrayIndexOutOfBoundsException[256];
    at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:360)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewIndices(IndicesClusterStateService.java:313)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:174)
    at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Please, any comments about the ways of solving this problem.

Info about ES version:

  "version" : {
    "number" : "2.4.1",
    "build_hash" : "c67dc32e24162035d18d6fe1e952c4cbcbe79d16",
    "build_timestamp" : "2016-09-27T18:57:55Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.2"
  }

Solution

  • I think the cause of the error is due to mapping of large sentences. What exactly you are trying to map? There is a limit of 256 characters if you look at the source code and you are breaching that limit. I get the same exception

    ArrayIndexOutOfBoundsException[256]

    if I try to map large strings.

    {
      "settings": {
        "analysis": {
          "char_filter": {
            "my_mapping": {
              "type": "mapping",
              "mappings": ["More than 256 characters. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. => exception will be thrown"]
            }
          },
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "standard",
              "char_filter": [
                "my_mapping"
              ]
            }
          }
        }
      }
    }
    

    I do not know your use case but you need to reduce the length of the strings you are mapping, then it should work.