Search code examples
pythonmatplotlibseabornvisualizationn-gram

How do I visualize two columns/lists of trigrams to see if the same wordcombination occur in both columns/lists?


so I have two Trigram-lists (20 Wordcombination each) e.g.

l1 = ('hello', 'its', 'me'), ('I', 'need', 'help') ...

l2 = ('I', 'need', 'help'), ('What', 'is', 'this') ...

Now I want to visualize these two list in one diagramm (maybe pairplot) to see if there are smiliarities (all 3 words must be the same).

Thank you in advance


Solution

  • The answer given from Larry the Llama seem to have missed the "see if there are similarities" as the solution uses set() which will remove any duplicates.

    If you desire a full iteration to find fully similar trigrams:

    merged = l1 + l2
    
    results_counter = {}
    
    # Iterate all the trigrams
    for index, trigram in enumerate(merged):
        # Iterate all the trigrams which lay after in the array
        for second_index in range(index, len(merged)):
            all_same = True
    
            # Find all of which are the same as the comparing trigram
            for word_index, word in enumerate(trigram):
                if merged[second_index][word_index] == trigram[word_index:
                    all_same = False
                    break
            
            # If trigram was not found in the results_counter add the key else returning the value 
            previous_found = results_counter.setDefault(str(trigram), 0)
            # Add one
            previous_found[str(trigram)] += 1
    
    # Will print the keys and the 
    for key in previous_found.keys():
        # Print the count for each trigram
        print(key, previous_found[key])
    

    Edit after clarification:

    import seaborn as sns
    import pandas as pd
    
    d1 = [("hello", "its", "me"), ("dont", "its", "me")]
    d2 = [("hello", "its", "me"), ("Hello", "I", "dont")]
    
    word_to_number = {} 
    number_to_word = {} # if you want to show the sentence again
    def one_hot(l):
        """
        This function one hot encodes (converts each appearens of a word
        to a number) and returns the encoded list while also adding the
        keys to converter dictionaries for reverse converting.
        """
        one_hot_encoded = []
        for trigram in l:
            encoded_trigram = []
            for word in trigram:
                # Add encoding of the word
                encoded_word = word_to_number.setdefault(word, len(word_to_number))
                number_to_word[encoded_word] = word
                # Add to the one hot encoded trigram = {} 
                encoded_trigram.append(encoded_word)
            
            # Add to the list which is sent in
            one_hot_encoded.append(encoded_trigram)
    
        return one_hot_encoded
    
    d1 = one_hot(d1)
    d2 = one_hot(d2)
    
    data = {}
    for ind, trigram in enumerate(d1 + d2):
        # This will add each word to be compared
        data["t" + str(ind)] = trigram
    
    frame = pd.DataFrame.from_dict(data)
    print(frame)
    
    plot = sns.pairplot(frame)
    # Make it clear
    plot.set(ylim=(frame.min().min() - 1, frame.max().max() + 1))
    plot.set(xlim=(frame.min().min() - 1, frame.max().max() + 1))
    
    import matplotlib.pyplot as plt
    plt.show()
    

    This piece will give you a pairplot of your trigrams, altough it will not be very intuitive as you must look for exactly linear values. You may use this but make sure you dont have to many different words as that will scew the axis and make it visually very difficult to see the results.