Search code examples
pythondictionarygraphcombinationsadjacency-matrix

Searching for every combination of inner dict keys in every combination of other inner dict keys, also in every combination of outer dict key


I'm not sure if the title describes my question well, but if there is something wrong I'll edit later. I've checked many questions related to this but since, the code is so nested, I'm not very experienced in programming and I need to use combinations I couldn't handle.

I have a nested dict, which is similar to this:

example_dictionary = {'I want to eat peach and egg.':{'apple':3, 'orange':2, 'banana':5},\
                   'Peach juice is so delicious.':{'apple':3, 'orange':5, 'banana':2}, \
'Goddamn monkey ate my banana.':{'rice':4, 'apple':6, 'monkey':2}, \
'They say apple is good for health.':{'grape':10, 'monkey':5, 'peach':5, 'egg':8}}

What I'm trying to do is building an adjacency matrix by following some rules. The rules are:

1) If a word in the any of the inner dict exist in any of the sentence(outer dict keys), then add a weight as the value of the word between related sentences.

2) If any of two sentences has the same inner dict key(word) but different value then multiply the values of the words and add as weight between related sentences.

Extra note: inner dicts can have different lengths, same inner dict keys(words) might have different values. I want them to be multiplied only in this case, if they have the same values I don't want to take into account.

Example:

Sentence1(0): I want to eat peach and egg. {'apple':3, 'orange':2, 'banana':5}

Sentence2(1): Peach juice is so delicious. {'apple':3, 'orange':5, 'banana':2}

Sentence3(2): Goddamn monkey ate my banana.{'rice':4, 'apple':6, 'monkey':2}

Sentence4(3): They say apple is good for health. {'grape':10, 'monkey':5, 'peach':5, 'egg':8}

Between 0 and 1: 5*2+5*2=20 (because, their apple's has the same value, just multiplied the values for orange and banana. And none of the words exists in any sentence.)

Between 2 and 3: (2*5=10 (monkey is the same key with different value) +

6 (the key of sentence3 'apple' exists in sentence4) +

5 (the key of sentence4 'monkey' exists in sentence3)= 21

Between 0 and 3: 3+5+8=16 (sentence1 key 'apple' exists in sentence4, and sentence4 keys 'egg' and 'peach' exist in sentence1.

I hope these examples makes it clear.

What I have tried( this was pretty much confusing for me due to nested structure and combinations):

from itertools import combinations, zip_longest
import networkx as nx

def compare_inner_dicts(d1,d2):
#this is for comparing the inner dict keys and multiplying them
#if they have the same key but different value
    values = []
    inner_values = 0
    for common_key in d1.keys() & d2.keys():
        if d1[common_key]!= d2[common_key]:
            _value = d1[common_key]*d2[common_key]
            values.append(_value)
            inner_values = sum([p for p in values])

    inner_dict_values = inner_values
    del inner_values  

    return inner_dict_values


def build_adj_mat(a_dict):
    gr = nx.Graph()
    for sentence, words in a_dict.items():

        sentences = list(a_dict.keys())
        gr.add_nodes_from(sentences)
        sentence_pairs = combinations(gr.nodes, 2)
        dict_pairs = combinations(a_dict.values(), 2)
        for pair, _pair in zip_longest(sentence_pairs, dict_pairs):
            numbers = []
            x_numbers = []
            #y_numbers = []
            sentence1 = pair[0]
            sentence2 = pair[1]
            dict1 = _pair[0]
            dict2 = _pair[1]

            inner_dict_numbers = compare_inner_dicts(dict1, dict2)
            numbers.append(inner_dict_numbers)

            for word, num in words.items():
                if sentence2.find(word)>-1:
                    x = words[word]
                    x_numbers.append(x)
                    numbers.extend(x_numbers)
#                if sentence1.find(word)>-1: #reverse case
#                    y = words[word]
#                    y_numbers.append(y)
#                    numbers.extend(y_numbers)

                    total = sum([p for p in numbers if len(numbers)>0])

                    if total>0:
                        gr.add_edge(sentence1, sentence2, weight=total)
                        del total
                    else: del total
                else: 
                    continue
                    numbers.clear()
                    x_numbers.clear()
                   #y_numbers.clear()

    return gr

G = build_adj_mat(example_dictionary)
print(nx.adjacency_matrix(G))

Expected result:

(0, 1) 5*2+5*2=20
(0, 2) 3*6=18+5=23
(0, 3) 3+5+8=16
(1, 0) 20
(1, 2) 3*6=18+2=20
(1, 3) 3+5=8
(2, 0) 23
(2, 1) 20
(2, 3) 2*5=10+5+6=21
(3, 0) 16
(3, 1) 8
(3, 2) 21

Output:

  (0, 2)        23
  (0, 3)        6
  (1, 2)        23
  (1, 3)        6
  (2, 0)        23
  (2, 1)        23
  (2, 3)        16
  (3, 0)        6
  (3, 1)        6
  (3, 2)        16

By comparing expected output and compared output I can understand one of the problem, which is that my code just checks if the word in sentence1 exist in sentence2, but doesn't do the reverse. I tried to solve it by using commented out part, but it returned more nonsense results. Also I'm not sure if there are any other problem. I don't know how to get the correct result, these two combinations and nested structure making me totally lost. Sorry for the long question, to make it clear I described everything. Any help would be greatly appreciated, thanks in advance.


Solution

  • You can use the following function:

    from collections import defaultdict
    import itertools as it
    import re
    
    
    def compute_scores(sentence_dict):
        scores = defaultdict(int)
        for (j, (s1, d1)), (k, (s2, d2)) in it.combinations(enumerate(sentence_dict.items()), 2):
            shared_keys = d1.keys() & d2.keys()
            scores[j, k] += sum(d1[k]*d2[k] for k in shared_keys if d1[k] != d2[k])
            scores[j, k] += sum(d1[k] for k in d1.keys() & get_words(s2))
            scores[j, k] += sum(d2[k] for k in d2.keys() & get_words(s1))
        return scores
    
    
    def get_words(sentence):
        return set(map(str.lower, re.findall(r'(?<=\b)\w+(?=\b)', sentence)))
    

    The result depends of course on what you define as a word, so you'd need to fill in your own definition in the function get_words. The default implementation seems to fit your example data. Since the score for a sentence pair is symmetric according to your definition, there's no need to consider the reverse pairing as well (it has the same score); i.e. (0, 1) has the same score as (1, 0). That's why the code uses itertools.combinations.

    Running the example data:

    from pprint import pprint
    
    example_dictionary = {
        'I want to eat peach and egg.': {'apple':3, 'orange':2, 'banana':5},
        'Peach juice is so delicious.': {'apple':3, 'orange':5, 'banana':2},
        'Goddamn monkey ate my banana.': {'rice':4, 'apple':6, 'monkey':2},
        'They say apple is good for health.': {'grape':10, 'monkey':5, 'peach':5, 'egg':8}}
    
    pprint(compute_scores(example_dictionary))
    

    gives the following scores:

    defaultdict(<class 'int'>,
                {(0, 1): 20,
                 (0, 2): 23,
                 (0, 3): 16,
                 (1, 2): 20,
                 (1, 3): 8,
                 (2, 3): 21})
    

    In case the dicts can not only contain words, but also phrases (i.e. multiple words) a slight modification of the original implementation will do (also works for single words):

    scores[j, k] += sum(weight for phrase, weight in d1.items() if phrase in s2.lower())
    scores[j, k] += sum(weight for phrase, weight in d2.items() if phrase in s1.lower())