Search code examples
pythonpandasmongodbmongodb-querysharding

How to use mongodb query operation on a very large database (have 3 shards of around 260-300 million in each)


I have to find data in between different date ranges column in a sharded database having total of around 800 million documents. I am using this query:

cursordata=event.aggregate([{"$match":{}},{"$unwind":},{"$project":{}}])

However, when I change it to a pandas dataframe

df=pd.DataFrame(cursordata)

It is taking for ever and not working at all, it just got stuck.

I have 2 choices:

  1. Either keep doing query for different conditions directly from mongodb or
  2. After changing to data to dataframe, perform operation for different conditions

Please suggest how to proceed.


Solution

  • Could we have a sample of documents? I think you should look for an index matching the fields you're querying.

    As a reminder, try to keep in mind the Equality, Sort, Range rule in MongoDB indexing.
    Besides, since you're in a sharded cluster you might want to have your sharding key in you query, otherwise the mongos will query all the shards (more info here)