I have an index in ElasticSearch that contains information of a user in each document, along with the facebook posts they have made (in a denormalized manner).
Each document contains: User_ID | User_Name | Post_Text | Post_Emojis
I want to retrieve the IDs of the users who have more than N posts.
I am new to using ElasticSearch, especially to Search DSL using python (https://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html)
I am creating buckets using the terms aggregation on the User_ID field, and want to filter the buckets based on the number of documents that fall inside each bucket.
This is the function I managed to create, however, as I'm unaware of the proper syntax, and am still confused with the documentation, I can't manage to execute it and attain the correct response.
def users_more_posts_than_query(search_object: Search, num_posts: int):
search_object = search_object.aggs.bucket('posts_count', 'terms', field='user_id')\
.pipeline("having_posts", "bucket_selector", buckets_path={"postsCount": "_count"}, script=f"params.postsCount > {num_posts}")
response = search_object.execute()
for hit in response.hits:
hit.user_id
Please point out what I am doing wrong here, and how I can achieve my desired goal.
Don't re-assign the search_object
and aggregations are returned separate from hits
def users_more_posts_than_query(search_object: Search, num_posts: int):
search_object.aggs.bucket('posts_count', 'terms', field='user_id').pipeline(
"having_posts", "bucket_selector",
buckets_path={"postsCount": "_count"},
script=f"params.postsCount > {num_posts}")
response = search_object.execute()
for bucket in response.aggregations.posts_count.buckets:
print(bucket)