Search code examples
elasticsearchlucenen-gram

Elastic search index for Ngram?


Say I have a sentence This is a new city

  1. Does Elastic search create index for all possible permutation/combination of a word. For example for word "city", will it create the index "it","ty","ity", "cit" etc ?
  2. Are these indexes created at document storage time or at run time ?
  3. Are these indexes kept in memory or in DB?

Solution

    1. That depends on your tokenizer. By default Elasticsearch uses Standant Tokenizer which divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. That means your sentence will be tokenized as this, is, a, new, city. You can create custom tokenizer if you like to.

    2. Documents are indexed when you put them to Elasticsearch.

    3. The data is kept in file system: https://www.elastic.co/blog/found-dive-into-elasticsearch-storage

    Here is a blog post about internals: https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up