I have around 10 huge files that contain python dictionaries like so:
dict1:
{
'PRO-HIS-MET': {
'A': ([1,2,3],[4,5,6],[7,8,9]),
'B': ([5,2],[6],[8,9]),
'C': ([3],[4],[7,8])},
'TRP-MET-GLN': {
'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])}
}
dict2:
{
'PRO-HIS-MET': {
'J': ([-657], [7,-20,3], [-8,-85,15])}
'TRP-MET-GLN':{
'K': ([1,2,3],[4,50,6],[7,80,9]),
'L': ([5,20],[60,80],[8,9])}
}
Basically they are all dictionaries of dictionaries. Each file is around 1 GB in size (the above is just an example of the data). Anyway, what I would like to do is join the 10 dictionaries together:
final:
{
'PRO-HIS-MET': {
'A': ([1,2,3],[4,5,6],[7,8,9]),
'B': ([5,2],[6],[8,9]),
'C': ([3],[4],[7,8])
'J': ([-657], [7,-20,3], [-8,-85,15])},
'TRP-MET-GLN': {
'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])
'K': ([1,2,3],[4,50,6],[7,80,9]),
'L': ([5,20],[60,80],[8,9])}
}
I have tried the following code on small files and it works fine:
import csv
import collections
d1 = {}
d2 = {}
final = collections.defaultdict(dict)
for key, val in csv.reader(open('filehere.txt')):
d1[key] = eval(val)
for key, val in csv.reader(open('filehere2.txt')):
d2[key] = eval(val)
for key in d1:
final[key].update(d1[key])
for key in d2:
final[key].update(d2[key])
out = csv.writer(open('out.txt', 'w'))
for k, v in final.items():
out.writerow([k, v])
However if I try that on my 1 GB files I quickly run out of memory by keeping d1 and d2 as well as the final dictionary in memory.
I have a couple ideas:
Instead of merging the dictionaries into one huge file (which will probably give me memory headaches in the future), how can I make many separate files that contain all the values for one key after merging data? For example, for the above data I would just have:
pro-his-met.txt:
'PRO-HIS-MET': {
'A': ([1,2,3],[4,5,6],[7,8,9]),
'B': ([5,2],[6],[8,9]),
'C': ([3],[4],[7,8])
'J': ([-657], [7,-20,3], [-8,-85,15])}
trp-met-gln.txt:
'TRP-MET-GLN': {
'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])
'K': ([1,2,3],[4,50,6],[7,80,9]),
'L': ([5,20],[60,80],[8,9])}
I don't have too much programming experience as a biologist (you may have guessed the above data represents a bioinformatics problem) so any help would be much appreciated!
The shelve
module is a very easy-to-use database for Python. It's nowhere near as powerful as a real database (for that, see @Voo's answer), but it will do the trick for manipulating large dictionaries.
First, create shelves from your dictionaries:
import shelve
s = shelve.open('filehere.db', flag='n', protocol=-1, writeback=False)
for key, val in csv.reader(open('filehere.txt')):
s[key] = eval(val)
s.close()
Now that you've shelved everything neatly, you can operate on the dictionaries efficiently:
import shelve
import itertools
s = shelve.open('final.db', flag='c', protocol=-1, writeback=False)
s1 = shelve.open('file1.db', flag='r')
s2 = shelve.open('file2.db', flag='r')
for key, val in itertools.chain(s1.iteritems(), s2.iteritems()):
d = s.get(key, {})
d.update(val)
s[key] = d # force write
s.close()