I have a collection of ~500M documents. Every time when I execute a query, I receive one or more documents from this collection. Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query. After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero). Meaning, 95% of the documents are not used.
My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?
What is the best practice in this case?
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
~500M documents
That is quite a solid figure, good job if that's true. So here is how I see the solution of the problem:
How does it looks like?
Imagine you have a two collections (or even two databases, if you are using micro service architecture)
Relevant docs / Irrelevant docs.
Basically you could use find only on relevant
docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find()
. This pattern will allows you to store old/historical data. And manage it via TTL index
or capped collection
.
You could also add some Redis magic to it. (Which uses precisely the same logic), take a look:
This article can also be helpful (as many others, like this SO question)
But don't try to replace Mongo with Redis, team them up instead.
Indexes
and .explain()
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
Yes, it will deal with your problem. To take a look, download MongoDB Compass, create this boolean
field in your schema, (don't forget to add default value), index the field and then use Explain
module with some query. But don't forget about compound
indexes! If you create field on one index, measure the performance by queering only this one field.
The result should been looks like this:
If your index have usage (and actually speed-up) Compass will shows you it.
To measure the performance of the queries (with and without indexing), use Explain
tab.
Actually, all this part can be done without Compass itself, via
.explain
and.index
queries. But Compass got better visuals of this process, so it's better to use it. Especially since he becomes absolutely free for all.