I'm new to the world of graphs and would appreciate some help :-)
I have a dataframe with 10 sentences and I calculated the cosine similarity between each sentence.
Original Dataframe:
text
0 i like working with text
1 my favourite colour is blue and i like beans
2 i have a cat and a dog that are both chubby Pets
3 reading is also working with text just in anot...
4 cooking is great and i love making beans with ...
5 my cat likes cheese and my dog likes beans
6 in some way text is a bit boring
7 cooking is stressful when it is too complicated
8 pets can be so cute but they are often a lot o...
9 working with pets would be a dream job
Calculate cosine similarity:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
k = test_df['text'].tolist()
# Vectorise the data
vec = TfidfVectorizer()
X = vec.fit_transform(k)
# Calculate the pairwise cosine similarities
S = cosine_similarity(X)
# add output to new dataframe
print(len(S))
T = S.tolist()
df = pd.DataFrame.from_records(T)
Output for cosine similiarties:
0 1 2 3 4 5 6 7 8 9
0 1.000000 0.204491 0.000000 0.378416 0.110185 0.000000 0.158842 0.000000 0.000000 0.282177
1 0.204491 1.000000 0.072468 0.055438 0.333815 0.327299 0.064935 0.112483 0.000000 0.000000
2 0.000000 0.072468 1.000000 0.000000 0.064540 0.231068 0.000000 0.000000 0.084140 0.000000
3 0.378416 0.055438 0.000000 1.000000 0.110590 0.000000 0.375107 0.097456 0.000000 0.156774
4 0.110185 0.333815 0.064540 0.110590 1.000000 0.205005 0.057830 0.202825 0.000000 0.071145
5 0.000000 0.327299 0.231068 0.000000 0.205005 1.000000 0.000000 0.000000 0.000000 0.000000
6 0.158842 0.064935 0.000000 0.375107 0.057830 0.000000 1.000000 0.114151 0.000000 0.000000
7 0.000000 0.112483 0.000000 0.097456 0.202825 0.000000 0.114151 1.000000 0.000000 0.000000
8 0.000000 0.000000 0.084140 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.185502
9 0.282177 0.000000 0.000000 0.156774 0.071145 0.000000 0.000000 0.000000 0.185502 1.000000
I now want to create a graph from both dataframes where my nodes are the sentences which are connected through the cosine smiliarty (edges). I have added the nodes as you can see below, but I'm not sure how to add the edges?
### Build graph
G = nx.Graph()
# Add node
G.add_nodes_from(test_df['text'].tolist())
# Add edges
G.add_edges_from()
You could set the indices and column names in df
as the text
column in your input dataframe (nodes in the network), and build a graph from it as an adjacency matrix using nx.from_pandas_adjacency
:
df_adj = pd.DataFrame(df.to_numpy(), index=test_df['text'], columns=test_df['text'])
G = nx.from_pandas_adjacency(df_adj)
G.edges(data=True)
EdgeDataView([('i like working with text ', 'i like working with text ', {'weight': 1.0}),
('i like working with text ', 'my favourite colour is blue and i like beans', {'weight': 0.19953178577876396}),
('i like working with text ', 'reading is also working with text just in anot...', {'weight': 0.39853956570404026})
...