Search code examples
pythondatabasezodb

maintaining a large list in python


I need to maintain a large list of python pickleable objects. The list is too large to be all stored in the RAM, so some database\paging mechanism is required. I need that the mechanism will support fast access for close (nearby) areas in the list.

The list should implement all the python-list features, but most of the time I will work sequentially: scan some range in the list and while scanning decide if I want to insert\pop some nodes in the scanning point.

The list can be very large (2-3 GB), and should not be all contained in the RAM at once. The nodes are small (100-200 bytes) but can contain various types of data.

A good solution for this could be using a BTree, where only the last accessed buckets are loaded in the RAM.

Using SQL tables is not good, since I'll need to implement a complex index-key mechanism. My data is not a table, its a simple python list, with the feature of adding elements in specific indexes, and popping elements from specific positions.

I tried ZODB and zc.blist, which implement a BTree based list that can be stored in a ZODB database file, but I don't know how to configure it so the above features will run in reasonable time. I don't need all the multi-threading\transactioning features. No one else will touch the database-file except for my single-thread program.

Can anyone explain me how to configure the ZODB\zc.blist so the above features will run fast, or show me a different large-list implementation?

Some quick&dirty code that I tried:

import time
import random

NODE_JUMP = 50000
NODE_ACCESS = 10000

print 'STARTING'


random_bytes = open('/dev/urandom', 'rb')

my_list = list()

nodes_no = 0

while True:
    nodes_no += NODE_JUMP
    start = time.time()
    my_list.extend(random_bytes.read(100) for i in xrange(NODE_JUMP))
    print 'extending to %s nodes took %.2f seconds' % (nodes_no, time.time() - start)

    section_start = random.randint(0, nodes_no -NODE_ACCESS -1)
    start = time.time()
    for index in xrange(section_start, section_start + NODE_ACCESS):
        # rotate the string
        my_list[index] = my_list[index][1:] + my_list[index][0]

    print 'access to %s nodes took %.2f seconds' % (NODE_ACCESS, time.time() - start,)

Print ended with:

extending to 5000000 nodes took 3.49 seconds
access to 10000 nodes took 0.02 seconds
extending to 5050000 nodes took 3.98 seconds
access to 10000 nodes took 0.01 seconds
extending to 5100000 nodes took 2.54 seconds
access to 10000 nodes took 0.01 seconds
extending to 5150000 nodes took 2.19 seconds
access to 10000 nodes took 0.11 seconds
extending to 5200000 nodes took 2.49 seconds
access to 10000 nodes took 0.01 seconds
extending to 5250000 nodes took 3.13 seconds
access to 10000 nodes took 0.05 seconds
Killed (not by me)

Solution

  • Using zc.blist can bring good results after all, and setting the "cache_size" option when creating the DB controls the size of the data that will remain in the RAM. The size of used RAM can grow bigger if you don't do "transaction.commit" often enough. By defining a large cache_size and doing transaction.commit often, the last accessed buckets of the blist will stay in the RAM, giving you fast access to them, and the amount of used RAM won't grow too much.

    Packing is very expensive though, but if you have a large harddisk, you don't have to do it that often anyway.

    Here is some code to try yourself. Run "top" at the background and change cache_size to see how it affects the amount of used RAM.

    import time
    import os
    import glob
    from ZODB import DB
    from ZODB.FileStorage import FileStorage
    import transaction
    from zc.blist import BList
    
    print('STARTING')
    
    random = open('/dev/urandom', 'rb')
    
    
    def test_list(my_list, loops = 1000, element_size = 100):
        print('testing list')
        start = time.time()
        for loop in xrange(loops):
            my_list.append(random.read(element_size))
        print('appending %s elements took %.4f seconds' % (loops, time.time() - start))
    
        start = time.time()
        length = len(my_list)
        print('length calculated in %.4f seconds' % (time.time() - start,))
    
        start = time.time()
        for loop in xrange(loops):
            my_list.insert(length / 2, random.read(element_size))
        print('inserting %s elements took %.4f seconds' % (loops, time.time() - start))
    
        start = time.time()
        for loop in xrange(loops):
            my_list[loop] = my_list[loop][1:] + my_list[loop][0]
        print('modifying %s elements took %.4f seconds' % (loops, time.time() - start))
    
        start = time.time()
        for loop in xrange(loops):
            del my_list[0]
        print('removing %s elements took %.4f seconds' % (loops, time.time() - start))
    
        start = time.time()
        transaction.commit()
        print('committing all above took %.4f seconds' % (time.time() - start,))
    
        del my_list[:loops]
        transaction.commit()
    
        start = time.time()
        pack()
        print('packing after removing %s elements took %.4f seconds' % (loops, time.time() - start))
    
    for filename in glob.glob('database.db*'):    
        try:
            os.unlink(filename)
        except OSError:
            pass
    
    db = DB(FileStorage('database.db'),
            cache_size = 2000)
    
    def pack():
        db.pack()
    
    root = db.open().root()
    
    root['my_list'] = BList()
    
    print('inserting initial data to blist')
    
    for loop in xrange(10):
        root['my_list'].extend(random.read(100) for x in xrange(100000))
        transaction.commit()
    
    transaction.commit()
    
    test_list(root['my_list'])