Search code examples
pythonelasticsearchelasticsearch-dsl

aggregate a field in elasticsearch-dsl using python


Can someone tell me how to write Python statements that will aggregate (sum and count) stuff about my documents?


SCRIPT

from datetime import datetime
from elasticsearch_dsl import DocType, String, Date, Integer
from elasticsearch_dsl.connections import connections

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q

# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])

s = Search(using=client, index="attendance")
s = s.execute()

for tag in s.aggregations.per_tag.buckets:
    print (tag.key)

OUTPUT

File "/Library/Python/2.7/site-packages/elasticsearch_dsl/utils.py", line 106, in __getattr__
'%r object has no attribute %r' % (self.__class__.__name__, attr_name))
AttributeError: 'Response' object has no attribute 'aggregations'

What is causing this? Is the "aggregations" keyword wrong? Is there some other package I need to import? If a document in the "attendance" index has a field called emailAddress, how would I count which documents have a value for that field?


Solution

  • First of all. I notice now that what I wrote here, actually has no aggregations defined. The documentation on how to use this is not very readable for me. Using what I wrote above, I'll expand. I'm changing the index name to make for a nicer example.

    from datetime import datetime
    from elasticsearch_dsl import DocType, String, Date, Integer
    from elasticsearch_dsl.connections import connections
    
    from elasticsearch import Elasticsearch
    from elasticsearch_dsl import Search, Q
    
    # Define a default Elasticsearch client
    client = connections.create_connection(hosts=['http://blahblahblah:9200'])
    
    s = Search(using=client, index="airbnb", doc_type="sleep_overs")
    s = s.execute()
    
    # invalid! You haven't defined an aggregation.
    #for tag in s.aggregations.per_tag.buckets:
    #    print (tag.key)
    
    # Lets make an aggregation
    # 'by_house' is a name you choose, 'terms' is a keyword for the type of aggregator
    # 'field' is also a keyword, and 'house_number' is a field in our ES index
    s.aggs.bucket('by_house', 'terms', field='house_number', size=0)
    

    Above we're creating 1 bucket per house number. Therefore, the name of the bucket will be the house number. ElasticSearch (ES) will always give a document count of documents fitting into that bucket. Size=0 means to give use all results, since ES has a default setting to return 10 results only (or whatever your dev set it up to do).

    # This runs the query.
    s = s.execute()
    
    # let's see what's in our results
    
    print s.aggregations.by_house.doc_count
    print s.hits.total
    print s.aggregations.by_house.buckets
    
    for item in s.aggregations.by_house.buckets:
        print item.doc_count
    

    My mistake before was thinking an Elastic Search query had aggregations by default. You sort of define them yourself, then execute them. Then your response can be split b the aggregators you mentioned.

    The CURL for the above should look like:
    NOTE: I use SENSE an ElasticSearch plugin/extension/add-on for Google Chrome. In SENSE you can use // to comment things out.

    POST /airbnb/sleep_overs/_search
    {
    // the size 0 here actually means to not return any hits, just the aggregation part of the result
        "size": 0,
        "aggs": {
            "by_house": {
                "terms": {
    // the size 0 here means to return all results, not just the the default 10 results
                    "field": "house_number",
                    "size": 0
                }
            }
        }
    }
    

    Work-around. Someone on the GIT of DSL told me to forget translating, and just use this method. It's simpler, and you can just write the tough stuff in CURL. That's why I call it a work-around.

    # Define a default Elasticsearch client
    client = connections.create_connection(hosts=['http://blahblahblah:9200'])
    s = Search(using=client, index="airbnb", doc_type="sleep_overs")
    
    # how simple we just past CURL code here
    body = {
        "size": 0,
        "aggs": {
            "by_house": {
                "terms": {
                    "field": "house_number",
                    "size": 0
                }
            }
        }
    }
    
    s = Search.from_dict(body)
    s = s.index("airbnb")
    s = s.doc_type("sleepovers")
    body = s.to_dict()
    
    t = s.execute()
    
    for item in t.aggregations.by_house.buckets:
    # item.key will the house number
        print item.key, item.doc_count
    

    Hope this helps. I now design everything in CURL, then use Python statement to peel away at the results to get what I want. This helps for aggregations with multiple levels (sub-aggregations).