I'm using Klepto archive to index specs of files in a folder tree. After scanning the tree, I want to quickly remove references to deleted files. But simply removing an item one-by-one from the file archive is extremely slow. Is there a way to sync the changes to the archive, or delete multiple keys at once? (The 'sync' method only appears to add new items)
The helpful answer by @Mike Mckerns to this question only deals with removing a single item: Python Saving and Editing with Klepto
Using files.sync() or files.dump() appears only to append data from the cache, not sync the deletes. Is there a way to delete keys from the cache and then sync those changes all-at-once. Individual deletes are far too slow.
Here's a working example:
from klepto.archives import *
import os
class PathIndex:
def __init__(self,folder):
self.folder_path=folder
self.files=file_archive(self.folder_path+'/.filespecs',cache=False)
self.files.load() #load memory cache
def list_directory(self):
self.filelist=[]
for folder, subdirs, filelist in os.walk(self.folder_path): #go through every subfolder in a folder
for filename in filelist: #now through every file in the folder/subfolder
self.filelist.append(os.path.join(folder, filename))
def scan(self):
self.list_directory()
for path in self.filelist:
self.update_record(path)
self.files.dump() #save to file archive
def rescan(self):
self.list_directory() #rescan original disk
deletedfiles=[]
#code to ck for modified files etc
#check for deleted files
for path in self.files:
try:
self.filelist.remove(path) #self.filelist - disk files - leaving list of new files
except ValueError:
deletedfiles.append(path)
#code to add new files, the files left in self.filelist
for path in deletedfiles:
self.delete_record(path)
#looking to here sync modified index from modifed to disk
def update_record(self,path):
self.files[path]={'size':os.path.getsize(path),'modified':os.path.getmtime(path)}
#add other specs - hash of contents etc.
def delete_record(self,path):
del(self.files[path]) #delete from the memory cache
#this next line slows it all down
del(self.files.archive[path]) #delete from the disk cache
#usage
_index=PathIndex('/path/to/root')
_index.scan()
#delete, modify some files
_index.rescan()
I see... you really are concerned about the speed of deleting one entry at at time from a file_archive
.
Ok, I agree. Using __delitem__
or pop
on a file_archive
is a bit brutal when you want to delete several entries. The slowdown is due to the file_archive
having to load and rewrite the entire file archive for each key you delete. This isn't the case for a dir_archive
or many of the other archives... but for a file_archive
it is. So that should be remedied...
UPDATE: I've added a new method that should enable faster dropping of specified keys...
>>> import klepto as kl
>>> ar = kl.archives.file_archive('foo.pkl')
>>> ar['a'] = 1
>>> ar['b'] = 2
>>> ar['c'] = 3
>>> ar['d'] = 4
>>> ar['e'] = 5
>>> ar.dump()
>>> ar.popkeys(list('abx'), None)
[1, 2, None]
>>> ar.sync(clear=True)
>>> ar
file_archive('foo.pkl', {'c': 3, 'e': 5, 'd': 4}, cached=True)
>>> ar.archive
file_archive('foo.pkl', {'c': 3, 'e': 5, 'd': 4}, cached=False)
Previously (i.e. in released versions), you could cheaply pop
the keys you want from the local cache, and then do an ar.sync(clear=True)
to remove the associated keys in the archive. However, doing that assumes you have all the keys you want to preserve in memory. So, instead of loading all the keys into memory, you can now (at least in the soon-to-be-released version) do popkeys
both in the cache and/or the archive to delete any unwanted keys from either.