I have a django app that pushes models into elasticsearch. I have a post signal to update after save but want to write a batch command that is updating all documents.
Within this process I want to remove documents that became stale (e.g. set inactive, got deleted etc in the database).
I started with something like this:
Something like this:
for i in updated_ids:
q = Q('match', **{'id': i})
f = f | q if f else q
queryset = dt.search().query(Bool(filter=[~Q(f)]))
for stale in queryset.scan():
stale.delete()
But the query becomes to long and that fails.
I wonder if there is a more efficient way of doing this.
I use elasticsearch-dsl upon elasticsearch.py. Django-Haystack is not an option.
I'm now doing it like that:
for dt, updated_ids in self.updated.items():
existing_ids_in_index = [d.id for d in dt.search().scan()]
stale_ids = list(set(existing_ids_in_index) - set(updated_ids))
for stale_id in stale_ids:
dt.find_one('id', stale_id).delete()
print("... {}: Removed {}.".format(dt.get_model().__name__, len(stale_ids)))
I could further optimize this with a delete_by_query
but I'm unsure about the details.