Search code examples
pythondictionarydefaultdict

Join two defaultdicts in Python


I parsed a huge database of bibliographic records (about 20 million records). Each record has unique ID field, a set of authors and a set of term/keywords that describe main content of the bibliographic record. For example, a typical bibliographic record looks like:

ID: 001
Author: author1
Author: author2
Term: term1
Term: term2

First, I create two defaultdicts to store authors and terms:

d1 = defaultdict(lambda : defaultdict(list))
d2 = defaultdict(lambda : defaultdict(list))

Next, I populate authors:

d1['id001'] = ['author1', 'author2'] 
d1['id002'] = ['author3'] 
d1['id003'] = ['author1', 'author4'] 

and keywords:

d2['id001'] = ['term1', 'term2']  
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']

The problem is how to join these two dictionaries to obtain data object which links between authors and terms directly:

author1|term1,term2,term4
author2|term1,term2
author3|term2,term3
author4|term4

I have two questions:

  • Is proposed approach appropriate or should I store/represent data in some other way?
  • Could you please roughly suggest how to join both dictionaries?

Solution

  • This is one way. Note, as demonstrated below, you do not need to use nested dictionaries or a defaultdict for your initial step.

    from collections import defaultdict
    
    d1 = {}
    d2 = {}
    
    d1['id001'] = ['author1', 'author2'] 
    d1['id002'] = ['author3'] 
    d1['id003'] = ['author1', 'author4'] 
    
    d2['id001'] = ['term1', 'term2']  
    d2['id002'] = ['term2', 'term3']
    d2['id003'] = ['term4']
    
    res = defaultdict(list)
    
    for ids in set(d1) & set(d2):
        for v in d1[ids]:
            res[v].extend(d2[ids])
    
    res = {k: sorted(v) for k, v in res.items()}
    
    # {'author1': ['term1', 'term2', 'term4'],
    #  'author2': ['term1', 'term2'],
    #  'author3': ['term2', 'term3'],
    #  'author4': ['term4']}