I parsed a huge database of bibliographic records (about 20 million records). Each record has unique ID field, a set of authors and a set of term/keywords that describe main content of the bibliographic record. For example, a typical bibliographic record looks like:
ID: 001
Author: author1
Author: author2
Term: term1
Term: term2
First, I create two defaultdict
s to store authors and terms:
d1 = defaultdict(lambda : defaultdict(list))
d2 = defaultdict(lambda : defaultdict(list))
Next, I populate authors:
d1['id001'] = ['author1', 'author2']
d1['id002'] = ['author3']
d1['id003'] = ['author1', 'author4']
and keywords:
d2['id001'] = ['term1', 'term2']
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']
The problem is how to join these two dictionaries to obtain data object which links between authors and terms directly:
author1|term1,term2,term4
author2|term1,term2
author3|term2,term3
author4|term4
I have two questions:
This is one way. Note, as demonstrated below, you do not need to use nested dictionaries or a defaultdict
for your initial step.
from collections import defaultdict
d1 = {}
d2 = {}
d1['id001'] = ['author1', 'author2']
d1['id002'] = ['author3']
d1['id003'] = ['author1', 'author4']
d2['id001'] = ['term1', 'term2']
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']
res = defaultdict(list)
for ids in set(d1) & set(d2):
for v in d1[ids]:
res[v].extend(d2[ids])
res = {k: sorted(v) for k, v in res.items()}
# {'author1': ['term1', 'term2', 'term4'],
# 'author2': ['term1', 'term2'],
# 'author3': ['term2', 'term3'],
# 'author4': ['term4']}