I am using "MongoDB v4.2.x". My server memory is just 4GB, and MongoDB is utilizing more than 60%. I am running simple queries, not even aggregations, and the response time is too slow.
The question: How to reduce memory consumption and improve response time when querying a MongoDB database?
Ideas up to now:
Is there a memory limitation option in MongoDB so that the parts of the loaded database that are not used may be outsourced to the disk?
Changing "wiredTiger" cache size up to 1GB, but response time stays very slow. Are there any other MongoDB tweaks?
Is there a workaround in Python instead of tweaking MongoDB itself?
If you just want to improve response time and reduce the memory consumed by MongoDB, a workaround is to load the MongoDB data into a pandas DataFrame, two options as follows.
PyMongo's bson module: If it is really just a problem of connecting to MongoDB, you can export the database (or at best the exact part of it that you really need) as a bson file and then read the whole bson file into one pandas DataFrame using pymongo's bson.decode_all(). See Read BSON file in Python? for details.
MongoDB collection: Or if you may have MongoDB open at least at the start, you can load data from a MongoDB collection into pandas DataFrame, see How can I load data from MongoDB collection into pandas' DataFrame?. After loading, close MongoDB to free the memory that the application consumes.
The extra time for loading the database at the start is a one-off cost. Once you have the whole database in one dataframe, you can use Python to query that in-memory DataFrame.
How to reduce memory consumption and response time in Python, among other things:
You can free memory during the running of your Python script, see How can I explicitly free memory in Python?, or you overwrite the objects.
Avoid unneeded object copies, change your objects with parameter "inplace" / self-assign changed objects / use .to_numpy(copy=False) / use other tricks to make object changes inplace, that is, avoid copies.
For large scale operations, convert pandas objects to numpy objects where possible. Python was built on numpy and is fastest on numpy, pandas offers more (and more comfortable) options but has more overhead.
Choose list comprehension over df.apply() / df.iterrows(), see Dataframe list comprehension “zip(…)”: loop through chosen df columns efficiently with just a list of column name strings.
And when you have a database in a DataFrame, you also want to consider:
And you might want to change between wide and long format: