Search code examples
elasticsearchindexinglucene

How do I get the list of the full indexed terms in an ElasticSearch index?


Very simple question. I have an ElasticSearch index with a text field. How do I get the list of all the words indexed for that field? Is there any simple method?

I'm working in python with elasticsearch library.


Solution

  • ⚠️ Warning

    fetching all indexed words of an index is expensive in terms of time and resources, especially if the number of unique terms is large. So please, be careful about it while using on production cluster.

    Solution

    To be able to do so, the Elasticsearch first needs to load all that words into memory, which is disabled by default for text fields (see FieldData mapping parameter for more info).

    Assuming that the field data is enable on your index, you can get the unique terms list, sorted by their frequence using below serach query:

    {
    "size": 0,
        "aggs": {
            "indexed_terms": {
                "terms": {
                    "field": "field_name",
                    "size": 10000 (1)
                }
            }
        }
    }
    
    1. size parameter controls the maximum number of unique terms to return.

    unless enabling the fieldData, you will encounter such a below error:

    Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on your fields in order to load field data by uninverting the inverted index. Note that this can use significant memory.

    For a single document ...

    If you only need to fetch such a list of indexed terms for a single document, you can simply use the _termsvector API, while you don't need to enable field data anymore.