Search code examples
pythondefaultdict

Create nested defaultdict in Python


I'm trying to create a nested dictionary in Python so that, given a list of strings, the dictionary records the number of occurrences of that string order.

For example, if the string list is:

["hey", "my", "name", "is"]

I would want the nested dictionary to look like:

{"hey": {"my": {"name": {"is": 1}}}}

I know I could probably use as key the whole list but I specifically want to separate the strings in the dictionary.

I also would want to approach this problem with a defaultdict dictionary, not the Python ones, and preferably use a recursively defined defaultdict.

This is what I tried:

from collections import defaultdict

nested_dict = lambda: defaultdict(nested_dict)
# Initialize ngrams as a nested defaultdict
ngrams = nested_dict()

# Function to update the nested defaultdict with the list of words
def update_ngrams(ngrams, words):
   current_dict = ngrams
   for word in words[:-1]:
       current_dict = current_dict[word]
   current_dict[words[-1]] += 1

# Example usage
update_ngrams(ngrams, ["My", "big", "cat"])
update_ngrams(ngrams, ["My", "big", "dog"])

but it gives me this error:

TypeError: unsupported operand type(s) for +=: 'collections.defaultdict' and 'int'

The expected output should be a map like this:

{"My": {"big": {"cat": 1, "dog": 1}}}

Solution

  • Here is a solution just using the standard dictionary. It creates a new dictionary for every layer except the last one, where it increments an integer.

    def update_ngrams(ng: dict, words: list[str]):
        n = len(words)
        d = ng
        for i, w in enumerate(words,1):
            if i == n:
                d.setdefault(w, 0)
                d[w] += 1
            else:
                d = d.setdefault(w, {})
    
    ngrams = {}
    update_ngrams(ngrams, ["My", "big", "cat"])
    update_ngrams(ngrams, ["My", "big", "dog"])
    ngrams
    # returns:
    {'My': {'big': {'cat': 1, 'dog': 1}}}
    

    The issue you are going to run into is that it does not handle cases where a leaf ('cat' or 'dog') key is also a branch. I.e.: ["My", "big", "dog", "Noodle"].

    My suggestion would be to store the leaf occurrence as a dictionary as well with a specific key to indicate it is the leaf count. Here is an example using the underscore as the leaf counter.

    def update_ngrams(ng: dict, words: list[str]):
        n = len(words)
        d = ng
        for i, w in enumerate(words,1):
            d = d.setdefault(w, {})
            if i == n:
                d.setdefault('_', 0)
                d['_'] += 1
    
    ngrams = {}
    update_ngrams(ngrams, ["My", "big", "cat"])
    update_ngrams(ngrams, ["My", "big", "dog"])
    update_ngrams(ngrams, ["My", "big", "dog", "Noodle"])
    ngrams
    # returns:
    {'My': {'big': {'cat': {'_': 1}, 'dog': {'_': 1, 'Noodle': {'_': 1}}}}}