I'm using MongoDB 7 to store data in a time series collection. When I find documents between a start and end some documents I'm sure are there (I can find them by _id) don't get retrieved, so I ran .validate()
on the collection which returned a warning:
Detected one or more documents in this collection incompatible with time-series specifications. For more info, see logs with log id 6698300.
Looking at the logs, I see messages like:
{
"t": { "$date": "2024-10-28T20:27:43.194+00:00" },
"s": "W",
"c": "STORAGE",
"id": 6698300,
"ctx": "conn193873",
"msg": "Document is not compliant with time-series specifications",
"attr": {
"namespace": "mydb.system.buckets.time_series_minute",
"recordId": "6467198e00ee59c36ca9d9d489",
"reason": {
"code": 53,
"codeName": "InvalidIdField",
"errmsg": "Mismatch between the embedded timestamp 2024-10-24T00:00:00.000+00:00 in the time-series bucket '_id' field and the timestamp 2024-10-24T04:00:00.000+00:00 in 'control.min' field."
}
}
}
I can fix the issue by dumping to JSON then re-importing:
mongoexport --db="timeSeriesTestDB" --collection="time_series_minute" --out="time_series.json" --jsonArray
mongoimport --db="new_db" --collection="time_series_minute" --file="time_series.json" --jsonArray
Then .validate()
returns no warnings and my queries return what they should, seemingly confirming that it's not an issue with the data or query but with how MongoDB internally buckets data -- apparently a mismatch between the buckets control.min and the documents timestamp? How does that happen?
The issue can be reproduced by simply spinning up a new MongoDB instance then inserting data:
docker run --name test -p 27017:27017 -d mongo:7
docker exec -it test mongosh
Then in mongosh paste this script:
// Switch to or create a new database for testing
use timeSeriesTestDB;
// Drop the collection if it exists (for repeatability)
db.time_series_minute.drop();
// Create the time-series collection
db.createCollection("time_series_minute", {
timeseries: {
timeField: "timestamp",
metaField: "metadata",
granularity: "minutes"
}
});
// Total number of data points to insert
const totalDataPoints = 100_000_000;
// Number of data points per batch
const batchSize = 10_000; // Adjust batch size based on available memory
// Total number of batches
const totalBatches = Math.ceil(totalDataPoints / batchSize);
// Start time for data generation
const startTime = new Date();
// Interval between data points in seconds
const intervalSeconds = 60; // One data point per minute
// Function to generate data batches with unique metadata every 1000 documents
function generateDataBatch(batchNumber, batchSize, startTime, intervalSeconds) {
const dataBatch = [];
const batchStartIndex = batchNumber * batchSize;
for (let i = 0; i < batchSize; i++) {
const index = batchStartIndex + i;
// Break if we've reached the total data points
if (index >= totalDataPoints) break;
// Generate a unique metadata value every 1000 documents
const metadataValue = `group_${Math.floor(index / 1000)}`;
dataBatch.push({
timestamp: new Date(startTime.getTime() + index * intervalSeconds * 1000),
value: Math.random() * 100,
metadata: metadataValue
});
}
return dataBatch;
}
// Insert data in batches
for (let batchNumber = 0; batchNumber < totalBatches; batchNumber++) {
// Generate the data batch
const dataBatch = generateDataBatch(batchNumber, batchSize, startTime, intervalSeconds);
// Insert the data batch into the collection
db.time_series_minute.insertMany(dataBatch);
// Log progress every 10 batches
if (batchNumber % 10 === 0) {
print(`Inserted batch ${batchNumber + 1} of ${totalBatches}`);
}
}
// Validate the collection
const validationResults = db.time_series_minute.validate({ full: true });
// Output the validation results
printjson(validationResults);
After running the script the validation should fail and some documents won't be found when querying by timestamp (even though they can be found when querying by _id or even a different range). For such a glaring bug there doesn't seem to be anything online about this but the time series collection type doesn't seem stand up to even the most basic tests e.g. inserting then later retrieving data. Is there something I'm missing about time series collections, or is the feature fundamentally broken in MongoDB?
According to MongoDB support the non-compliant documents are a red herring (there is a bug in the validation itself), but the inconsistent queries are a known bug that can be caused by deletes. From Mongo support:
the incorrect query results are the effect of deleting a time series measurement which represents the control.min.t time of an underlying time series bucket. When this occurs, the affected bucket is repacked with new lower bounds(control.min.t), and this is creating an inconsistency in the query results
This bug is tracked by SERVER-94559 which should be fixed in 7.0.16 release.