Search code examples
pythonbioinformaticsdictionarydefaultdict

Joining large dictionaries by identical keys


I have around 10 huge files that contain python dictionaries like so:

    dict1:
    {   
        'PRO-HIS-MET': {
            'A': ([1,2,3],[4,5,6],[7,8,9]),
            'B': ([5,2],[6],[8,9]),
            'C': ([3],[4],[7,8])},
        'TRP-MET-GLN': {
            'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])}
    }

    dict2:
    {   
        'PRO-HIS-MET': {
            'J': ([-657], [7,-20,3], [-8,-85,15])}

        'TRP-MET-GLN':{
            'K': ([1,2,3],[4,50,6],[7,80,9]), 
            'L': ([5,20],[60,80],[8,9])}
    }

Basically they are all dictionaries of dictionaries. Each file is around 1 GB in size (the above is just an example of the data). Anyway, what I would like to do is join the 10 dictionaries together:

    final:
    {
        'PRO-HIS-MET': {
            'A': ([1,2,3],[4,5,6],[7,8,9]),
            'B': ([5,2],[6],[8,9]),
            'C': ([3],[4],[7,8])
            'J': ([-657], [7,-20,3], [-8,-85,15])},
        'TRP-MET-GLN': {
            'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])
            'K': ([1,2,3],[4,50,6],[7,80,9]), 
            'L': ([5,20],[60,80],[8,9])}
    }

I have tried the following code on small files and it works fine:

    import csv
    import collections
    d1 = {}
    d2 = {}
    final = collections.defaultdict(dict)

    for key, val in csv.reader(open('filehere.txt')):
        d1[key] = eval(val)
    for key, val in csv.reader(open('filehere2.txt')):
        d2[key] = eval(val)

    for key in d1:
        final[key].update(d1[key])
    for key in d2:
        final[key].update(d2[key])

    out = csv.writer(open('out.txt', 'w'))
    for k, v in final.items():
        out.writerow([k, v])

However if I try that on my 1 GB files I quickly run out of memory by keeping d1 and d2 as well as the final dictionary in memory.

I have a couple ideas:

  1. Is there a way where I can just load the keys from the segmented dictionaries, compare those, and if the same ones are found in multiple dictionaries just combine the values?
  2. Instead of merging the dictionaries into one huge file (which will probably give me memory headaches in the future), how can I make many separate files that contain all the values for one key after merging data? For example, for the above data I would just have:

    pro-his-met.txt:
    'PRO-HIS-MET': {
        'A': ([1,2,3],[4,5,6],[7,8,9]),
        'B': ([5,2],[6],[8,9]),
        'C': ([3],[4],[7,8])
        'J': ([-657], [7,-20,3], [-8,-85,15])}
    trp-met-gln.txt:
    'TRP-MET-GLN': {
        'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])
        'K': ([1,2,3],[4,50,6],[7,80,9]), 
        'L': ([5,20],[60,80],[8,9])}
    

I don't have too much programming experience as a biologist (you may have guessed the above data represents a bioinformatics problem) so any help would be much appreciated!


Solution

  • The shelve module is a very easy-to-use database for Python. It's nowhere near as powerful as a real database (for that, see @Voo's answer), but it will do the trick for manipulating large dictionaries.

    First, create shelves from your dictionaries:

    import shelve
    s = shelve.open('filehere.db', flag='n', protocol=-1, writeback=False)
    for key, val in csv.reader(open('filehere.txt')):
        s[key] = eval(val)
    s.close()
    

    Now that you've shelved everything neatly, you can operate on the dictionaries efficiently:

    import shelve
    import itertools
    s = shelve.open('final.db', flag='c', protocol=-1, writeback=False)
    s1 = shelve.open('file1.db', flag='r')
    s2 = shelve.open('file2.db', flag='r')
    for key, val in itertools.chain(s1.iteritems(), s2.iteritems()):
        d = s.get(key, {})
        d.update(val)
        s[key] = d # force write
    s.close()