so I have two Trigram-lists (20 Wordcombination each) e.g.
l1 = ('hello', 'its', 'me'), ('I', 'need', 'help') ...
l2 = ('I', 'need', 'help'), ('What', 'is', 'this') ...
Now I want to visualize these two list in one diagramm (maybe pairplot) to see if there are smiliarities (all 3 words must be the same).
Thank you in advance
The answer given from Larry the Llama seem to have missed the "see if there are similarities" as the solution uses set() which will remove any duplicates.
If you desire a full iteration to find fully similar trigrams:
merged = l1 + l2
results_counter = {}
# Iterate all the trigrams
for index, trigram in enumerate(merged):
# Iterate all the trigrams which lay after in the array
for second_index in range(index, len(merged)):
all_same = True
# Find all of which are the same as the comparing trigram
for word_index, word in enumerate(trigram):
if merged[second_index][word_index] == trigram[word_index:
all_same = False
break
# If trigram was not found in the results_counter add the key else returning the value
previous_found = results_counter.setDefault(str(trigram), 0)
# Add one
previous_found[str(trigram)] += 1
# Will print the keys and the
for key in previous_found.keys():
# Print the count for each trigram
print(key, previous_found[key])
Edit after clarification:
import seaborn as sns
import pandas as pd
d1 = [("hello", "its", "me"), ("dont", "its", "me")]
d2 = [("hello", "its", "me"), ("Hello", "I", "dont")]
word_to_number = {}
number_to_word = {} # if you want to show the sentence again
def one_hot(l):
"""
This function one hot encodes (converts each appearens of a word
to a number) and returns the encoded list while also adding the
keys to converter dictionaries for reverse converting.
"""
one_hot_encoded = []
for trigram in l:
encoded_trigram = []
for word in trigram:
# Add encoding of the word
encoded_word = word_to_number.setdefault(word, len(word_to_number))
number_to_word[encoded_word] = word
# Add to the one hot encoded trigram = {}
encoded_trigram.append(encoded_word)
# Add to the list which is sent in
one_hot_encoded.append(encoded_trigram)
return one_hot_encoded
d1 = one_hot(d1)
d2 = one_hot(d2)
data = {}
for ind, trigram in enumerate(d1 + d2):
# This will add each word to be compared
data["t" + str(ind)] = trigram
frame = pd.DataFrame.from_dict(data)
print(frame)
plot = sns.pairplot(frame)
# Make it clear
plot.set(ylim=(frame.min().min() - 1, frame.max().max() + 1))
plot.set(xlim=(frame.min().min() - 1, frame.max().max() + 1))
import matplotlib.pyplot as plt
plt.show()
This piece will give you a pairplot of your trigrams, altough it will not be very intuitive as you must look for exactly linear values. You may use this but make sure you dont have to many different words as that will scew the axis and make it visually very difficult to see the results.