Search code examples
pythonklepto

How to delete multiple items from Klepto file archive fast?


I'm using Klepto archive to index specs of files in a folder tree. After scanning the tree, I want to quickly remove references to deleted files. But simply removing an item one-by-one from the file archive is extremely slow. Is there a way to sync the changes to the archive, or delete multiple keys at once? (The 'sync' method only appears to add new items)

The helpful answer by @Mike Mckerns to this question only deals with removing a single item: Python Saving and Editing with Klepto

Using files.sync() or files.dump() appears only to append data from the cache, not sync the deletes. Is there a way to delete keys from the cache and then sync those changes all-at-once. Individual deletes are far too slow.

Here's a working example:

from klepto.archives import *
import os

class PathIndex:
    def __init__(self,folder):
        self.folder_path=folder
        self.files=file_archive(self.folder_path+'/.filespecs',cache=False)
        self.files.load() #load memory cache

    def list_directory(self):
        self.filelist=[]
        for folder, subdirs, filelist in os.walk(self.folder_path): #go through every subfolder in a folder
            for filename in filelist: #now through every file in the folder/subfolder
                self.filelist.append(os.path.join(folder, filename))

    def scan(self):
        self.list_directory()
        for path in self.filelist:
            self.update_record(path)
        self.files.dump() #save to file archive

    def rescan(self):
        self.list_directory() #rescan original disk
        deletedfiles=[]

        #code to ck for modified files etc            
        #check for deleted files
        for path in self.files:
            try:
                self.filelist.remove(path)  #self.filelist - disk files - leaving list of new files
            except ValueError:
                deletedfiles.append(path)

        #code to add new files, the files left in self.filelist
        for path in deletedfiles:
            self.delete_record(path)
        #looking to here sync modified index from modifed to disk

    def update_record(self,path):
        self.files[path]={'size':os.path.getsize(path),'modified':os.path.getmtime(path)}
        #add other specs - hash of contents etc.

    def delete_record(self,path):
        del(self.files[path]) #delete from the memory cache
        #this next line slows it all down
        del(self.files.archive[path]) #delete from the disk cache

#usage
_index=PathIndex('/path/to/root')
_index.scan()
#delete, modify some files
_index.rescan()

Solution

  • I see... you really are concerned about the speed of deleting one entry at at time from a file_archive.

    Ok, I agree. Using __delitem__ or pop on a file_archive is a bit brutal when you want to delete several entries. The slowdown is due to the file_archive having to load and rewrite the entire file archive for each key you delete. This isn't the case for a dir_archive or many of the other archives... but for a file_archive it is. So that should be remedied...

    UPDATE: I've added a new method that should enable faster dropping of specified keys...

    >>> import klepto as kl
    >>> ar = kl.archives.file_archive('foo.pkl')
    >>> ar['a'] = 1
    >>> ar['b'] = 2
    >>> ar['c'] = 3
    >>> ar['d'] = 4
    >>> ar['e'] = 5
    >>> ar.dump()
    >>> ar.popkeys(list('abx'), None)
    [1, 2, None]
    >>> ar.sync(clear=True)
    >>> ar
    file_archive('foo.pkl', {'c': 3, 'e': 5, 'd': 4}, cached=True)
    >>> ar.archive
    file_archive('foo.pkl', {'c': 3, 'e': 5, 'd': 4}, cached=False)
    

    Previously (i.e. in released versions), you could cheaply pop the keys you want from the local cache, and then do an ar.sync(clear=True) to remove the associated keys in the archive. However, doing that assumes you have all the keys you want to preserve in memory. So, instead of loading all the keys into memory, you can now (at least in the soon-to-be-released version) do popkeys both in the cache and/or the archive to delete any unwanted keys from either.