Search code examples
pythondjangoelasticsearchelasticsearch-dslelasticsearch-py

Best practise to remove stale documents in elasticsearch


I have a django app that pushes models into elasticsearch. I have a post signal to update after save but want to write a batch command that is updating all documents.

Within this process I want to remove documents that became stale (e.g. set inactive, got deleted etc in the database).

I started with something like this:

  • update all documents and store the updated / created ids.
  • create one gigantic exclude-query
  • delete all documents that are matching

Something like this:

for i in updated_ids:
    q = Q('match', **{'id': i})
    f = f | q if f else q
queryset = dt.search().query(Bool(filter=[~Q(f)]))
for stale in queryset.scan():
    stale.delete()

But the query becomes to long and that fails.

I wonder if there is a more efficient way of doing this.

I use elasticsearch-dsl upon elasticsearch.py. Django-Haystack is not an option.


Solution

  • I'm now doing it like that:

    for dt, updated_ids in self.updated.items():
       existing_ids_in_index = [d.id for d in dt.search().scan()]
       stale_ids = list(set(existing_ids_in_index) - set(updated_ids))
       for stale_id in stale_ids:
           dt.find_one('id', stale_id).delete()
       print("... {}: Removed {}.".format(dt.get_model().__name__, len(stale_ids)))
    

    I could further optimize this with a delete_by_query but I'm unsure about the details.