Search code examples
pythonpandascounter

python creating nested dictionary counter issue


I am studying the correlation between word occurrence and the response variable. To do so I am trying to create a dictionary of dictionaries with the following structure:

{word_1:{response_value:word_1_occurrence_with_same_response_value},
 word_2:{response_value:word_2_occurrence_with_same_response_value}...}

Everything looks working, except for the last line of my code.

Here's some data example:

data = pd.DataFrame({
    'message': ['Weather update', 'the Hurricane is over',
                'Checking the weather', 'beautiful weather'],
    'label': [0, 1, 0, 1]
})

and my code:

word_count = {}

for idx,msg in enumerate(data['message']):
    msg = msg.lower()
    label = data['label'][idx]
    for word in msg.split():
        word_count[word]={}
        word_count[word][label]=word_count.get(word,0)+1

I get the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-72-b195c90ef226> in <module>
      6     for word in msg.split():
      7         word_count[word]={}
----> 8         word_count[word][label]=word_count.get(word,0)+1

TypeError: unsupported operand type(s) for +: 'dict' and 'int' 

The output I am trying to obtain is the following

{'weather': {0: 2}, 'update': {0: 1},'the': {1: 1},'hurricane': {1: 1},
 'is':{1:1},'over':{1:1}, 'checking':{0:1},'the':{0:1},'weather':{1:1},
 'beautiful':{1:1}}

I tried various solutions but I can't get the counter working, just assigning values to the keys.
I have also only found posts here about counting from an already existing nested dictionary, whereas here is the opposite, however, please direct me to the appropriate post if I missed it.

Thanks


Solution

  • Your desired output cannot be obtained in python as you can't have two different values for the same key in a dictionary. Keys have to be unique. Here is what I came up with:

    data = pd.DataFrame({
        'message': ['Weather update', 'the Hurricane is over',
                    'Checking the weather', 'beautiful weather'],
        'label': [0, 1, 0, 1]
    })
    
    word_count = {}
    
    for idx,msg in enumerate(data['message']):
        msg = msg.lower()
        label = data['label'][idx]
        for word in msg.split():
            word_count[word][label] = word_count.setdefault(word, {}).setdefault(label, 0)+1
    
    print(word_count)
    

    Output:

    {'weather': {0: 2, 1: 1}, 'update': {0: 1}, 'the': {1: 1, 0: 1}, 'hurricane': {1: 1}, 'is': {1: 1}, 'over': {1: 1}, 'checking': {0: 1}, 'beautiful': {1: 1}}