Search code examples
databasemongodbindexingcompound-key

Performance of filter+sort when filtering with a range on a compound key that has a key with many increments (like millisecond timestamps)


Imagine that I have a Reddit-type application and I want to allow users to see posts from the last day or week or month, sorted by score (descending).

So let's say I have this index:

db.collection("posts").createIndex({
  subreddit: 1,      // string
  creationTime: -1,  // integer (millisecond timestamp)
  score: -1,         // decimal
});

And the sort of query that I want to run is like this:

db.collection("posts").find({
  subreddit: "foo", 
  creationTime: { $gt: Date.now()-1000*60*60*24 }, // posts from last 24 hours
}).sort({ score: -1 });

Here's my question: Since creationTime has many possible values, I'm wondering whether I'd get significantly better performance (assuming ~millions of posts per subreddit) if I instead created a property like creationHour, which would reduce the number of values by a factor of 3600, meaning that there would be many posts per time increment. I don't need any more time resolution than hourly for my purposes.

I don't know enough about how MongoDB's indexing works, but I just have a vague intuition that the huge number of possible values of creationTime might make this sort of query significantly slower. On the other hand, I figure that this is quite a common sort of operation so I'd expect this sort of query to have been optimised to run efficiently either way. Hoping that a pro who understands the MongoDB's indexing can chime in to help me understand this.


Solution

  • I believe the short answer to your direct question is that using an integer (representing a timestamp) versus bucketing by hour should not have any noticeable performance difference for the current query and index.

    Why?

    Your question really has nothing to do with MongoDB specifically apart from the fact that they, like most database vendors, use a B+ tree data structure for their standard indexes. So the question is effectively about how B+ trees can be traversed.

    One characteristic of B+ trees is that they are an ordered data structure. So in the example index given, items would be ordered by subreddit first (in ascending order), then creationTime next (descending), and finally score (descending). When the database recognizes that it can use this index to execute the query, it will attempt to minimize the portion of the index that will be scanned as much as possible. We can see that when we look at the explain output for your query with this index as the reported index bounds are:

              indexBounds: {
                subreddit: [ '["foo", "foo"]' ],
                creationTime: [ '[inf.0, 1666897635152.0)' ],
                score: [ '[MaxKey, MinKey]' ]
              }
    

    What would the change from creationTime to creationHour do here with respect to the indexed data and the index bounds when executing this query? For the former, values such as 1666897635152 (representing 2022-10-27 19:07:15.152Z) would get truncated to something like 1666897200000 (representing 2022-10-27 19:07:00.000Z). This may save a little bit of space due to compression, but it doesn't reduce the number of entries in the database.

    For the latter, logically the query would still want to pull "the same" results. So the query would either issue the exact same predicate for the time (which would either include or exclude a few results depending on how the hours are truncated), or the value in the query itself would be rounded first (similarly adding or removing the entries on that border hour depending on truncation). In either case, the index bound for the proposed creationHour field would be effectively the same as they are for the current creationTime one. Perhaps something like this:

              indexBounds: {
                subreddit: [ '["foo", "foo"]' ],
                creationTime: [ '[inf.0, 1666897200000.0)' ],
                score: [ '[MaxKey, MinKey]' ]
              }
    

    You specifically mention that your proposed change "would reduce the number of values by a factor of 3600, meaning that there would be many posts per time increment." It is true that you are reducing the number of distinct values captured in the database and index (hence the potential space savings). But overall the number of matching documents that the database is processing and the manner in which they are being processed is very similar. Therefore I would expect performance to be similar as well.

    Index Key Ordering

    You may notice something of interest in the explain snippets above, notably the index bounds for score are:

    score: [ '[MaxKey, MinKey]' ]
    

    Why is that and what does this mean?

    The only place you are using score in your query is to sort on. Therefore there are no bounds that we can apply to that portion of the index scan. The (only) benefit of having the sort key in that position in the index is that it allows the database to sort the results after reading the index and before retrieving the full document. We can see that in the structure of the explain:

        winningPlan: {
          stage: 'FETCH',
          inputStage: {
            stage: 'SORT',
            sortPattern: { score: -1 },
            ...
            inputStage: {
              stage: 'IXSCAN',
    

    With this query and index structure, the database must manually sort the results after identifying all of them. Alternatively, databases can use indexes to sort query results. Doing so, however, requires a specific ordering of the index keys with respect to the query predicate. An alternative index structure would be to promote score: -1 to the second position of the index. Doing so allows the database to traverse the index in the requested sort order preventing it from having to manually sort the results:

        winningPlan: {
          stage: 'FETCH',
          inputStage: {
            stage: 'IXSCAN',
            keyPattern: { subreddit: 1, score: -1, creationTime: -1 },
    

    This approach generally requires scanning more index keys. This is because it is now the second index key that is unbounded during the scan:

            indexBounds: {
              subreddit: [ '["foo", "foo"]' ],
              score: [ '[MaxKey, MinKey]' ],
              creationTime: [ '[inf.0, 1666899575532.0)' ]
            }
    

    As your system grows, I would expect that the current index structure you have would outperform this alternative despite the fact that it has to manually perform the sort. Identifying the subset of matching results, particularly just for the last 24 hours, is likely going to be faster than scanning over most of the index in presorted order.

    More reading on that topic is here.

    Date Type

    As a minor observation, it might be worthwhile store the creationTime as a proper Date type. This may make some future querying more convenient. There are a variety of date expression operators in the aggregation framework for example, including $dateTrunc which can be used to truncate a date by its hour.

    Summary

    Apart from potentially changing the creationTime from an integer to a proper Date type, your current approach seems reasonable. There probably is not a compelling reason from a performance perspective to make the adjustment to the resolution of the creationTime field.