Search code examples
pythonelasticsearchelasticsearch-dsl

How to find the distinct values in the array in all the indexes using elasticsearch-dsl?


I am using elasticsearch-dsl in django. And I have a DocType document defined and a keyword containing a list of values.

Here is my code for the same.

from elasticsearch_dsl import DocType, Text, Keyword

class ProductIndex(DocType):
    """
    Index for products
    """
    id = Keyword()
    slug = Keyword()
    name = Text()
    filter_list = Keyword()

filter_list is the array here which contains multiple values. Now I have some values say sample_filter_list which are the distinct values from and some of these elements can be present in some product's filter_list array. So given this sample_filter_list, I want all the unique elements of filter_list of all the products whose filter_list intersection with sample_filter_list in not null.

for example I have 5 products whose filter_list is like :
1) ['a', 'b', 'c']
2) ['d', 'e', 'f']
3) ['g', 'h', 'i']
4) ['j', 'k', 'l']
5) ['m', 'n', 'o']
and if my sample filter_list is ['a', 'd', 'g', 'j', 'm']
then elasticsearch should return an array containg distinct element 
i.e. ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o']

Solution

  •             Writing Answer not specific to django but general,
                Suppose you have some ES index some_index2 with mapping
    
                PUT some_index2
                {
                  "mappings": {
                    "some_type": {
                      "dynamic_templates": [
                        {
                          "strings": {
                            "mapping": {
                              "type": "string"
                            },
                            "match_mapping_type": "string"
                          }
                        }
                      ],
                      "properties": {
                        "field1": {
                          "type": "string"
                        },
                        "field2": {
                          "type": "string"
                        }
                      }
                    }
                  }
                }
    
            Also you have inserted the documents 
            {
                "field1":"id1",
                "field2":["a","b","c","d]
            }
            {
                "field1":"id2",
                "field2":["e","f","g"]
            }
            {
                "field1":"id3",
                "field2":["e","l","k"]
            }
    
        Now as you stated you want all the distinct values of field2(filter_list) in your case, You can easily get that by using ElasticSearch term aggregation
    
        GET some_index2/_search
        {
        "aggs": {
          "some_name": {
            "terms": {
              "field": "field2",
              "size": 10000
            }
          }
        },
        "size": 0
        }
    
        Which will give you result as:
    
        {
          "took": 2,
          "timed_out": false,
          "_shards": {
            "total": 5,
            "successful": 5,
            "failed": 0
          },
          "hits": {
            "total": 3,
            "max_score": 0,
            "hits": []
          },
          "aggregations": {
            "some_name": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "e",
                  "doc_count": 2
                },
                {
                  "key": "a",
                  "doc_count": 1
                },
                {
                  "key": "b",
                  "doc_count": 1
                },
                {
                  "key": "c",
                  "doc_count": 1
                },
                {
                  "key": "d",
                  "doc_count": 1
                },
                {
                  "key": "f",
                  "doc_count": 1
                },
                {
                  "key": "g",
                  "doc_count": 1
                },
                {
                  "key": "k",
                  "doc_count": 1
                },
                {
                  "key": "l",
                  "doc_count": 1
                }
              ]
            }
          }
        }
    
        where buckets contains the list of all the distinct values.
        you can easily iterate through bucket and find the value under KEY.
    
    Hope this is what is required to you.