given a collection called probe
which grows by 11 million records per week and contains a record of vulnerability exploit probes that include a date
property and a source.ip
property, ie:
{
_id: ObjectId('64943e2001000100160b843e'),
date: '2023-06-22T12:27:12Z',
source: {
ip: '217.17.230.18',
port: 56882
},
target: {
ip: '88.198.64.58',
port: 22
}
}
what type of index would improve the performance of a query which selects distinct source.ip where date is greater than a variable date? ie (node.js):
export const recent = async (since) => (
await client.db('cichlid').collection('probe').distinct(
'source.ip',
{
date: {
$gte: since.toISOString().replace('.000Z', 'Z')
}
}
)
);
the query is run from an aws lambda behind api gateway which limits it to a 30 second timeout. there are currently no indexes (other than the automatically created ones) on the collection and it has started timing out as the collection size has grown.
date
, source.ip
)?date
?source.ip
?compass gives me a choice of index types (asc, desc, 2dsphere, text). since the date filter is generally now() minus 1 hour, i assume desc is appropriate for that field and text might be appropriate for the ip address. but i'm only guessing...
i went ahead and created 3 indexes:
compass shows that only the simple date (desc) index gets used. the distinct source.ip query went from taking 44 seconds to complete, to 500 milliseconds. i guess this answers the question.
in this particular case, the index needs to be on the filter constraint field and this increases performance by a factor of 88x.