Search code examples
pythonelasticsearch

Elasticsearch search['hits']['total']['value'] includes deleted documents in total, how do I purge/refresh or alternatively exclude deleted documents?


I am following this tutorial to include elasticsearch in my python flask application, which is further relatively unrelated to this tutorial.

I am running elasticsearch-8.9.2 locally on my windows pc. My application is running flask with a mysql database (locally)

When I ingested the database content (news snippets) into elasticsearch and showed my search query in the application I realised that I had several duplicates in my elasticsearch index (and realised they were duplicates in my database). The 1000 entries were four times present resulting in 4000 entries. As such, my search which should give me 6 results gave me 24 results.

I deleted the content of my database and the index on elasticsearch:

with app.app_context():
    app.elasticsearch.indices.delete(index='news')

following the cleaning up of my database and verification it now included 1000 news snippets, I used the given classmethod to add everything from the database to the index on elasticsearch:

@classmethod
    def reindex(cls):
        for obj in cls.query:
            add_to_index(cls.__tablename__, obj)

However, while the following search now limits the returned list of id elements to what I am looking for, while the total number of results, which is taken from search['hits']['total']['value'] has increased from 24 to 30. I want this number to be the total number of results that is not deleted. The following is the query from the tutorial

def query_index(index, query, page, per_page):
    if not current_app.elasticsearch:
        return [], 0
    search = current_app.elasticsearch.search(
        index=index,
        body={'query': {'multi_match': {'query': query, 'fields': ['*']}},
              'from': (page - 1) * per_page, 'size': per_page})
    ids = [int(hit['_id']) for hit in search['hits']['hits']]
    return ids, search['hits']['total']['value']

I have found that deleted items in elasticsearch are not purged, but still exist and are marked as "deleted". As such, I have tried to refresh the indices: app.elasticsearch.indices.refresh(index='news') I have restarted elasticsearch to force a refresh.


Solution

  • It returns you the number of the actual non-deleted results in the response. The only place where deleted results are counted is index stats, and even there when you delete the index you physically remove all records including deleted, so they shouldn't show up even there.

    I think you have a wrong assumption and the issue is somewhere else. You should more carefully look at the results that you are getting back in figure out why your app is adding these records from the database. I would start with adding a counter in the reindex operation to check how many times you run it, how many records are getting added and if this process handles the record ids correctly.