Search code examples
pythonlistdictionarykey-valuesimilarity

How can I analyze list values in a python dictionary, add extra dictionary values showing the different keys and the list value count in common?


I have a larger dictionary in the form as shown below. I am trying to find similarities between keys and the values which are in list format.

data_dict = {623823: ['draintheswamp', 'swimming'], 856273: ['elect2015'], 8236472: [], 623526: ['yearmatters'], 72645: ['elect2015'], 723641: ['draintheswamp'], 712641: ['swimming'], 917265: ['elect2015', 'draintheswamp']}

I want to output two (extra dictionary values) that show the key to which each key is related to if it finds a similarity or null and the number of similar values in that list.
Columns in the dictionary values would be (key, [text_used], [related_key, number_of_related_texts])

Brief example on the look of the new dictionary result :

new_dict = {623823: (['draintheswamp', 'swimming'], [(723641, 1), (712641, 1)]), 856273: (['elect2015'], [(72645, 1), (917265, 1)]), ...}

Solution

  • So I hacked together a quick method for generating the dictionary you requested. For brevity, I used the np.intersect1d method to quickly count shared items between dict-value lists.

    import numpy as np
    
    new_data = {} #new dict
    for key in data_dict.keys():
        new_data[key] = () #set empty tuple
        x = [] #set empty list x
        y = [] #set empty list y
        for k, v in data_dict.items():
            if key == k: #don't count similarity on same key
                pass
            else:
                shared = np.intersect1d(data_dict[key],v) #all shared items
                if shared:
                    for item in shared:
                        x.append(item) #add shared item to list x
                        y.append((k, len(shared))) # add k and number of shared items to list y
                    new_data[key] = (list(set(x)),y) #update new dict
                else:
                    pass #pass if no shared items found...
    

    If you have any questions that the comments don't answer, please let me know. I hope this helps your project. It's also not optimized, since it's a quick-and-dirty routine to mimic what you were asking for. Good luck!