Search code examples
pythonsqliteblobzodbrelstorage

How are blobs removed in RelStorage pack?


This question is related to How to pack blobstorage with Plone and RelStorage

Using zodb database with RelStorage and sqlite as its backend I am trying to remove unused blobs. Currently db.pack does not remove the blobs from disc. The minimum working example below demonstrates this behavior:

import logging
import numpy as np
import os
import persistent
from persistent.list import PersistentList
import shutil
import time
from ZODB import config, blob

connectionString = """
%import relstorage
<zodb main>
<relstorage>
blob-dir ./blob
keep-history false
cache-local-mb 0
<sqlite3>
    data-dir .
</sqlite3>
</relstorage>
</zodb>
"""


class Data(persistent.Persistent):
    def __init__(self, data):
        super().__init__()

        self.children = PersistentList()

        self.data = blob.Blob()
        with self.data.open("w") as f:
            np.save(f, data)


def main():
    logging.basicConfig(level=logging.INFO)
    # Initial cleanup
    for f in os.listdir("."):
        if f.endswith("sqlite3"):
            os.remove(f)

    if os.path.exists("blob"):
        shutil.rmtree("blob", True)

    # Initializing database
    db = config.databaseFromString(connectionString)
    with db.transaction() as conn:
        root = Data(np.arange(10))
        conn.root.Root = root

        child = Data(np.arange(10))
        root.children.append(child)

    # Removing child reference from root
    with db.transaction() as conn:
        conn.root.Root.children.pop()

    db.close()

    print("blob directory:", [[os.path.join(rootDir, f) for f in files] for rootDir, _, files in os.walk("blob") if files])
    db = config.databaseFromString(connectionString)
    db.pack(time.time() + 1)
    db.close()
    print("blob directory:", [[os.path.join(rootDir, f) for f in files] for rootDir, _, files in os.walk("blob") if files])


if __name__ == "__main__":
    main()

The example above does the following:

  1. Remove any previous database in the current directory along with the blob directory.
  2. Create a database/storage from scratch adding two objects (root and child), while child is referenced by root and perform a transaction.
  3. Remove the linkage from root to child and perform a transaction.
  4. Close the database/storage
  5. Open the database/storage and perform db.pack for one second in the future.

The output of the minimum working example is the following:

INFO:ZODB.blob:(23376) Blob directory '<some path>/blob/' does not exist. Created new directory.
INFO:ZODB.blob:(23376) Blob temporary directory './blob/tmp' does not exist. Created new directory.
blob directory: [['blob/.layout'], ['blob/3/.lock', 'blob/3/0.03da352c4c5d8877.blob'], ['blob/6/.lock', 'blob/6/0.03da352c4c5d8877.blob']]
INFO:relstorage.storage.pack:pack: beginning pre-pack
INFO:relstorage.storage.pack:Analyzing transactions committed Thu Aug 27 11:48:17 2020 or before (TID 277592791412927078)
INFO:relstorage.adapters.packundo:pre_pack: filling the pack_object table
INFO:relstorage.adapters.packundo:pre_pack: Filled the pack_object table
INFO:relstorage.adapters.packundo:pre_pack: analyzing references from 7 object(s) (memory delta: 256.00 KB)
INFO:relstorage.adapters.packundo:pre_pack: objects analyzed: 7/7
INFO:relstorage.adapters.packundo:pre_pack: downloading pack_object and object_ref.
INFO:relstorage.adapters.packundo:pre_pack: traversing the object graph to find reachable objects.
INFO:relstorage.adapters.packundo:pre_pack: marking objects reachable: 4
INFO:relstorage.adapters.packundo:pre_pack: finished successfully
INFO:relstorage.storage.pack:pack: pre-pack complete
INFO:relstorage.adapters.packundo:pack: will remove 3 object(s)
INFO:relstorage.adapters.packundo:pack: cleaning up
INFO:relstorage.adapters.packundo:pack: finished successfully
blob directory: [['blob/.layout'], ['blob/3/.lock', 'blob/3/0.03da352c4c5d8877.blob'], ['blob/6/.lock', 'blob/6/0.03da352c4c5d8877.blob']]

As you can see db.pack does remove 3 objects "will remove 3 object(s)" but the blobs in the file system are unchanged.

In the unit tests of RelStorage it appears that they do test if the blobs are removed from the file system (see here), but in the script above it does not work.

What am I doing wrong? Any hint/link/help is appreciated.


Solution

  • By default, the blob storage directory is used as a cache, storing blob data that also is stored in the database; the idea is that loading blob data from a local disk cache is faster than from a remote database server. Packing in a history-free storage with caching blob storage doesn’t delete unreachable blob files, instead relying on the file size limiter to evict stale cache data when room needs to be made. However, you did not set a size limit, so the cache grows unbounded and those unreachable blob files will live on forever.

    Packing can’t remove blob files here because the cache is local to each ZODB client; it is outside the jurisdiction of the ZODB storage, as it were. This may not be as apparent when using SQLite as the database layer but imagine using Postgres instead, on a separate server, with multiple clients across different computers and you can see that cache clean-up is not feasible when packing.

    Note that the other blob storage option is the shared blob storage, which is probably closer to what you expected this to be: all blob data is stored on disk, not in the database. When used with a remote database server and multiple clients you’d need to place this on something like a NTFS share. Packing operates directly on the blobs in that case and unreachable blob files are removed immediately when you pack.

    You have two options:

    • Set a size limit for the blob cache by setting blob-cache-size. Packing still won’t remove the blob files, but they will be removed when space is running low.

    • Switch to a shared blob cache (set shared-blob-dir to true). For a sqlite-backed RelStorage this probably makes more sense than a caching blob storage, in spite of the dire warnings in the documentation!

    So the easiest change would be to switch blob storage modes:

    connectionString = """
    %import relstorage
    <zodb main>
    <relstorage>
    blob-dir ./blob
    shared-blob-dir true
    keep-history false
    cache-local-mb 0
    <sqlite3>
        data-dir .
    </sqlite3>
    </relstorage>
    </zodb>
    """
    

    The output then changes to:

    INFO:ZODB.blob:(26177) Blob directory '<some path>/blob/' does not exist. Created new directory.
    INFO:ZODB.blob:(26177) Blob temporary directory './blob/tmp' does not exist. Created new directory.
    blob directory: [['blob/.layout'], ['blob/0x00/0x00/0x00/0x00/0x00/0x00/0x00/0x03/0x03da4f169582cd22.blob', 'blob/0x00/0x00/0x00/0x00/0x00/0x00/0x00/0x03/.lock'], ['blob/0x00/0x00/0x00/0x00/0x00/0x00/0x00/0x06/0x03da4f169582cd22.blob', 'blob/0x00/0x00/0x00/0x00/0x00/0x00/0x00/0x06/.lock']]
    INFO:relstorage.storage.pack:pack: beginning pre-pack
    INFO:relstorage.storage.pack:Analyzing transactions committed Tue Sep  1 01:22:35 2020 or before (TID 277621285453417864)
    INFO:relstorage.adapters.packundo:pre_pack: filling the pack_object table
    INFO:relstorage.adapters.packundo:pre_pack: Filled the pack_object table
    INFO:relstorage.adapters.packundo:pre_pack: analyzing references from 7 object(s) (memory delta: 0 KB)
    INFO:relstorage.adapters.packundo:pre_pack: objects analyzed: 7/7
    INFO:relstorage.adapters.packundo:pre_pack: downloading pack_object and object_ref.
    INFO:relstorage.adapters.packundo:pre_pack: traversing the object graph to find reachable objects.
    INFO:relstorage.adapters.packundo:pre_pack: marking objects reachable: 4
    INFO:relstorage.adapters.packundo:pre_pack: finished successfully
    INFO:relstorage.storage.pack:pack: pre-pack complete
    INFO:relstorage.adapters.packundo:pack: will remove 3 object(s)
    INFO:relstorage.adapters.packundo:pack: cleaning up
    INFO:relstorage.adapters.packundo:pack: finished successfully
    blob directory: [['blob/.layout'], ['blob/0x00/0x00/0x00/0x00/0x00/0x00/0x00/0x03/0x03da4f169582cd22.blob', 'blob/0x00/0x00/0x00/0x00/0x00/0x00/0x00/0x03/.lock']]
    

    And yes, the blob dir layout changes, so it can deal with every possible OID, ever. OID 6 has been removed however.

    The unit tests you found are only run when testing with a shared blob cache:

    # If the blob directory is a cache, don't test packing,
    # since packing can not remove blobs from all caches.
    test_packing = shared_blob_dir