Search code examples
pythondictionarydefaultdict

defaultdict vs dict element initialization


I am trying to optimize the performance of a script that looks up similar words in a lexicon for each word given.

Each unique word is to be split into letter n-grams and for each n-gram, the lexicon returns a list of words that contain the same letter n-gram. Each word from this list is then added to a dictionary as a key and it's value is incremented by one. This gives me a dictionary of similar words with corresponding frequency scores.

word_dict = {}
get = word_dict.get
for letter_n_gram in word:
    for entry in lexicon[n_gram]:
        word_dict[entry] = get(entry, 0) + 1

This implementation works, but the script could supposedly run faster by switching the dict for collections.defaultdict.

word_dd = defaultdict(int)
for letter_n_gram in word:
    for entry in lexicon[n_gram]:
        word_dd[entry] += 1

No other code has been changed.

I was under the impression that both code snippets (most importantly the score adding) should work in the exact same way, i.e. if the key exists, increase its value by 1, if it does not exist, create the key and set the value to 1.

After running the new code, however, some of the keys had values of 0, which I find logically impossible.

Is my logic or knowledge of defaultdict functionality flawed? If not, how can any value in word_dd be set to 0?

edit: I am also very sure that no other part of the script skews these results, as I test the dictionary immediately after shown code by using:

for item in word_dd.iteritems():
    if item[1] == 0:
        print "Found zero value element"
        break

Solution

  • Alas, I have found the fault in my code.

    As there are many consequent word n-grams with the same tested word in my input set, I am only creating the dictionary of similar words once per unique tested word.

    This dictionary is then used for other purposes with keys being tested multiple times. This, of course, can create zero-valued elements, if the dictionary is collections.defaultdict and the default factory is not set to None.

    Testing for the zero-valued elements was, however, done in each main loop - therefore finding zero-valued elements created in the previous loop.

    After indenting the testing code into the proper part, the results are as expected - no zero-valued elements immediately after creation.

    I would like to apologize to everyone for the faulty and incomplete construction of my question - it was impossible for anyone else to find the error.