Search code examples
pythondatabasezodb

Is ZODB Bloat necessarily a bad thing?


I'm writing a software which retrieves webpages, extracts some key information about them into an object and then writes that to a ZODB database. I end up with roughly 350,000 of these objects being written to my database.

After my code had run for some time it started posting this message whenever I add a new object to the database...

UserWarning: The <class 'persistent.mapping.PersistentMapping'>
object you're saving is large. (26362014 bytes.)
Perhaps you're storing media which should be stored in blobs.

Perhaps you're using a non-scalable data structure, such as a
PersistentMapping or PersistentList.

Perhaps you're storing data in objects that aren't persistent at
all. In cases like that, the data is stored in the record of the
containing persistent object.

In any case, storing records this big is probably a bad idea.

So my question is first of all, is the 26MB that the error message is referring to as being for the single object being added or the entire database. Each of these objects should be quite small but the message is showing up on each new one added.


Solution

  • 26MB is the size of the "pickle" produced for the entire PersistentMapping object. As the message says, PersistentMapping isn't scalable: if you add one more key-value pair to it, and commit the transaction, it will write out that 26MB (plus the size of the single new pair you added) again. Every time you change your PersistentMapping instance and commit, the entire object is stored to disk (including all the objects you added before). Over a series of additions and commits, this yields a total database size quadratic in the number of items you've added, and also suffers quadratic time behavior (each new item you add takes longer than the last one added, because each commit writes out all previously added items too, not just the last item added).

    Look in the docs for the various flavors of BTree ZODB supports. Those are scalable, persistent key-value mappings, and are almost certainly what you should be using for this task.

    Note that ZODB implements several flavors of BTree for efficiency. Most general is the OOBTree, which allows general objects for both keys and values. Most specific is the IIBTree, which allows only 32-bit integers for keys and for values. Here's a tutorial:

    http://pythonhosted.org/BTrees