Search code examples
pythonmongodbpython-3.xpymongo

pymongo method of getting statistics for collection byte usage?


The MongoDB Application FAQ mentions that short field names are a technique that can be used for small documents. This led me to thinking, "what's a small document anyway?"

I'm using pymongo, is there any way I can write some python to scan a collection, and get a feel of the ratio of bytes used for field descriptors vs bytes used for actual field data?

I'm tangentially curious on what the basic byte overhead is per doc, as well.


Solution

  • There is no builtin way to get the ratio of space used for keys in BSON documents versus space used for actual field values. However, the collstats and dbstats commands can give you useful information on collection and database size. Here's how to use them in pymongo:

    from pymongo import MongoClient
    
    client = MongoClient()
    db = client.test
    
    # print collection statistics
    print db.command("collstats", "events")
    
    # print database statistics
    print db.command("dbstats")
    

    You could always hack something up to get a pretty good estimate though. If all of your documents in a collection have the same schema, then something like this isn't half bad:

    1. Count up the total number of characters in the field names of a document, and call this number a.
    2. Add one to a for each field in order to account for the terminating character. Let the result be b.
    3. Multiply b by the number of documents in the collection, and let the result be denoted by c.
    4. Divide c by the "size" field returned by collStats (assuming collStats is scaled to return size in bytes). Let this value be d.

    Now d is the proportion of the total data size of the collection which is used to store field names.