Search code examples
pythoncython

Can cython optimize python dictionary memory and lookup speed as well?


I have a class which primarily contains the three dicts:

class KB(object):

  def __init__(self):

    # key:str value: list of str
    linear_patterns = defaultdict(list)

    # key:str value: list of str        
    nonlinear_patterns = defaultdict(list)

    # key: str value: dict
    pattern_meta_info = {}
    ...
    self.__initialize()

def __initialize(self):
    # the 3 dicts are populated 
    ...

The size of the 3 dicts are below:

linear_patterns: 100,000
non_linear_patterns: 900,000
pattern_meta_info: 700,000

After the program is run and done, it takes about 15 seconds to release the memory. When I reduces the number of the dict sizes above by loading less data in initialization, the memory release is faster, so I judge it's due to these dict sizes that cause memory release slower. The total program takes about 8G memory. Also, after the dicts are built, all operations are lookup, no modifications.

Is there a way to use cython to optimize the 3 data structures above, especially in terms of memory usage? Is there a similar cython dictionary that can replaces the python dicts?


Solution

  • It seems unlikely that a different dictionary or object type would change much. Destructor performance is dominated by the memory allocator. That will be roughly the same unless you switch to a different malloc implementation.

    If this is only about object destruction at the end of your program, most languages (but not Python) would allow you to use call exit while keeping the KB object alive. The OS will release the memory much quicker when the process terminates. So why bother? Unfortunately that doesn't work with Python's sys.exit() since this merely raises an exception.

    Everything else relies on changing the data structure or algorithm. Are your strings highly redundant? Maybe you can reuse string objects by interning them. Keep them in a shared set to use the same string in multiple places. A simple call to string = sys.intern(string) is enough. Unlike in earlier versions of Python, this will not keep the string object alive beyond its use so you don't run the risk of leaking memory in a long-running process.

    You could also pool the strings in one large allocation. If access is relatively rare, you could change the class to use one large io.StringIO object for its contained strings and all dictionaries just deal with (offset, length) tuples into that buffer.

    That still leaves many tuple and integer objects but those use specialized allocators that may be faster. Also, the length integer will come from the common pool of small integers and not allocate a new object.

    A final thought: 8 GB of string data. You sure you don't want a small sqlite or dbm database? Could be a temporary file