Search code examples
pythonpandasdictionarydefaultdict

How to store dictionary inside defaultdict(list)


import pandas as pd
import re
from collections import defaultdict

d = defaultdict(list)
df = pd.read_csv('https://raw.githubusercontent.com/twittergithub/hello/main/category_app_id_text_1_month_march_2021%20(1).csv')

and the output for the dataframe is ..

suggestions                                           category
0      ['jio tv', 'jio', 'jiosaavn', 'jiomart', 'jio ...  ['BOOKS_AND_REFERENCE', 
'PRODUCTIVITY', 'MUSIC...
1      ['instagram', 'internet', 'instacart', 'instag...  ['SOCIAL', 'COMMUNICATION', 
'FOOD_AND_DRINK', ...
2      ['instagram', 'instacart', 'instagram download...  ['SOCIAL', 'FOOD_AND_DRINK', 
'VIDEO_PLAYERS', ...
3      ['vpn', 'vpn free', 'vpn master', 'vpn private...  ['TOOLS', 'TOOLS', 'TOOLS', 'TOOLS', 
'TOOLS', ...
4      ['pubg', 'pubg mobile lite', 'pubg lite', 'pub...  ['GAME_ACTION', 'GAME_ACTION', 
'TOOLS', 'GAME_...
...                                                  ...                                                
...
49610  ['inbuilt camera app', 'inbuilt screen recorde...  ['PHOTOGRAPHY', 'VIDEO_PLAYERS', 
'TOOLS', 'PRO...
49611  ['mpsc science app in marathi', 'mpsc science ...  ['EDUCATION', 'EDUCATION', 
'EDUCATION', 'EDUCA...
49612  ['ryerson', 'ryerson university', 'ryerson mob...  ['BOOKS_AND_REFERENCE', 'EDUCATION', 
'EDUCATIO...
49613  ['eeze', 'eezee english', 'ezee tab', 'deezer'...  ['TRAVEL_AND_LOCAL', 'EDUCATION', 
'BUSINESS', ...
49614  ['hindi love story books free download', 'hind...  ['BOOKS_AND_REFERENCE', 
'BOOKS_AND_REFERENCE',...

If want to create a dictionary of category columns for each item present in the list of category in each row and inside each category create a dictionary of suggestions from suggestions columns and if suggestions or categories are repeating, then just increment the counter inside the dictionary.

dictionary = defaultdict(list)
for i in range(df.shape[0]):
    categories = set(re.sub(r'[^\w\s]', '', df.loc[i, 'category']).split())
    for category in categories:
        suggestions = set(re.sub(r'[^\w\s]', '', df.loc[i, 'suggestions']).split())
        for suggestion in suggestions:
            if suggestion not in dictionary[category]:
                dictionary[category][suggestion] = 1
            else:
                dictionary[category][suggestion] += 1

but I am getting empty list inside list of category inside defaultdict. I hope that you understand my question.


Solution

  • It's probably a bit easier and faster to do with pandas:

    from ast import literal_eval
    
    # create cartesian product of categories and suggestions for each record,
    # and calculate value_counts
    z = pd.merge(
        df['category'].apply(literal_eval).explode(),
        df['suggestions'].apply(literal_eval).explode(),
        left_index=True,
        right_index=True).value_counts()
    
    # convert to nested dict
    d = {l: z.xs(l).to_dict() for l in z.index.levels[0]}
    d
    

    Output:

    {'ART_AND_DESIGN': {'flipaclip': 39,
      'mehndi design': 28,
      'ibis paint x': 22,
      'u launcher lite': 21,
      'poster maker': 20,
      'poster maker design app free': 20,
      'ibis paint': 18,
      'mehndi design 2021': 18,
      'mehandi ka design': 18,
      'u launcher': 18,
    ...
    

    Having said this, if you want to go with the original approach, all you need to fix is to declare the dictionary as defaultdict(dict) instead of defaultdict(list), because you're making a nested dictionary, not a dictionary of lists:

    dictionary = defaultdict(dict)
    for i in range(df.shape[0]):
    ...