Search code examples
elasticsearchstringtokenizern-gram

N-Grams with frequency number using elasticsearch


I used n-grams tokenizer to create the n-grams in elasticsearch but I can't retrieve the frequency of each gram either bi-gram or tri-gram. how can I do that?


Solution

  • It wasn't clear from your question exactly what you were trying to do. It's generally a good idea to post the code you've tried, and as specific a description of your problem as possible.

    At any rate, I think this code will come close to doing what you want:

    http://sense.qbox.io/gist/f357f15360719299ac556e8082afe26e4e0647d1

    I started with the code in this answer, then refined some using the information in the docs for shingle token filters. Here is the mapping I ended up with:

    PUT /test_index
    {
       "settings": {
          "analysis": {
             "analyzer": {
                "evolutionAnalyzer": {
                   "tokenizer": "standard",
                   "filter": [
                      "standard",
                      "lowercase",
                      "custom_shingle"
                   ]
                }
             },
             "filter": {
                "custom_shingle": {
                   "type": "shingle",
                   "min_shingle_size": "2",
                   "max_shingle_size": "3",
                   "filler_token": "",
                   "output_unigrams": true
                }
             }
          }
       },
       "mappings": {
          "doc": {
             "properties": {
                "content": {
                   "type": "string",
                   "index_analyzer": "evolutionAnalyzer",
                   "search_analyzer": "standard",
                   "term_vector": "yes"
                }
             }
          }
       }
    }
    

    Again, be careful using term vectors in production.