I have a MongoDB collection with >100k documents (this number will keep growing). Each document has a few fields that are a single value, and about 50 fields that are each an array of length 1000. I am analyzing results in R using rmongodb.
In rmongodb I am using mongo.find.all()
with query set to some combination of criteria to search for, and fields set to a subset of the fields to return. The equivalent in the mongo shell would be something like:
db.collection.find({query1 : "value1", query2 : "value2"},{field1 : 1, field2 : 1, field3 : 1})
This returns a data.frame of the results, which I do some post-processing on and end up with a data.table.
What I would like to do is add some safeguards to the query. If the query is broad, and the fields returned are many of the larger array fields, the resulting data.table can be in the tens of GB. This might be what is expected, but I would like to add some flags or error checking so that someone doesn't accidentally try to return hundreds of GB at once.
I know I can get a count of the number of documents that match a query (mongo.count
in rmongodb, db.collection.find({...},{...}).count()
in the shell). I can also get an average document size (db.collection.stats().avgObjSize
).
What I do not know how to do, nor do I know if it is possible, is to get the size (in MB, not number) of a find before the find is actually returned. Since I am often returning only a subset of the fields, the count and avgObjSize don't give me a very accurate estimate of how big the resulting data.table will be. The size would need to take into account both the query and the fields.
Is there a command like db.collection.find({},{}).sizeOf()
that would return the size in MB of my find(query,fields)? The only options I can see are count()
and size()
both of which return the number of documents.
You can iterate through cursor manually (as it done in mongo.cursor.to.list ) and iteratively check the size of the resulting object. Something like this:
LIMIT = 1024 * 1024 * 1024
res_size = 0
mongo.cursor.to.list_with_check <- function (cursor,
keep.ordering = TRUE,
limit = LIMIT) {
# make environment to avoid extra copies
e <- new.env(parent = emptyenv())
i <- 1
while (mongo.cursor.next(cursor) && res_size < limit) {
val = mongo.bson.to.list(mongo.cursor.value(cursor))
res_size = res_size + object.size(val)
assign(x = as.character(i),
value = val, envir = e)
i <- i + 1
}
# convert back to list
res <- as.list(e)
if (isTRUE(keep.ordering)) setNames(res[order(as.integer(names(res)))], NULL)
else setNames(res, NULL)
}
After that you can convert it into data.table
via data.table::rbindlist()
.