I am trying to determine the chance of homophily, then the homophily, of a dataset having nodes as keys and colors as values.
Example:
Node Target Colors
A N 1
N A 0
A D 1
D A 1
C X 1
X C 0
S D 0
D S 1
B 0
R N 2
N R 2
Colors are associated with the Node column and span from 0 to 2 (int). The steps for calculating the chance of homophily on a characteristic z (in my case Color) are illustrated as follows:
c_list=df[['Node','Colors']].set_index('Node').T.to_dict('list')
print("\nChance of same color:", round(chance_homophily(c_list),2))
where chance_homophily
is defined as follows:
# The function below takes a dictionary with characteristics as keys and the frequency of their occurrence as values.
# Then it computes the chance homophily for that characteristic (color)
def chance_homophily(dataset):
freq_dict = Counter([tuple(x) for x in dataset.values()])
df_freq_counter = freq_dict
c_list = list(df_freq_counter.values())
chance_homophily = 0
for class_count in c_list:
chance_homophily += (class_count/sum(c_list))**2
return chance_homophily
Then the homophily is calculated as follows:
def homophily(G, chars, IDs):
"""
Given a network G, a dict of characteristics chars for node IDs,
and dict of node IDs for each node in the network,
find the homophily of the network.
"""
num_same_ties = 0
num_ties = 0
for n1, n2 in G.edges():
if IDs[n1] in chars and IDs[n2] in chars:
if G.has_edge(n1, n2):
num_ties+=1
if chars[IDs[n1]] == chars[IDs[n2]]:
num_same_ties+=1
return (num_same_ties / num_ties)
G should be built from my dataset above (so taking into account both node and target columns). I am not totally familiar with this network property but I think I have missed something in the implementation (e.g., is it correctly taking count of relationships among nodes in the network?). In another example (with different dataset) found on the web
the characteristic is also color (though it is a string, while I have a numeric variable). I do not know if they take into consideration relationship among nodes to determine, maybe using adjacency matrix: this part has not been implemented in my code, where I am using
G = nx.from_pandas_edgelist(df, source='Node', target='Target')
Your code works perfectly fine. The only thing you are missing is the IDs dict, which would map the names of your nodes to the names of the nodes in the graph G. By creating the graph from a pandas edgelist, you are already naming your nodes, as they are in the data.
This renders the use of the "IDs"dict unnecessary. Check out the example below, one time wihtou the IDs dict and one time with a trivial dict to use the original function:
import networkx as nx
import pandas as pd
from collections import Counter
df = pd.DataFrame({"Node":["A","N","A","D","C","X","S","D","B","R","N"],
"Target":["N","A","D","A","X","C","D","S","","N","R"],
"Colors":[1,0,1,1,1,0,0,1,0,2,2]})
c_list=df[['Node','Colors']].set_index('Node').T.to_dict('list')
G = nx.from_pandas_edgelist(df, source='Node', target='Target')
def homophily_without_ids(G, chars):
"""
Given a network G, a dict of characteristics chars for node IDs,
and dict of node IDs for each node in the network,
find the homophily of the network.
"""
num_same_ties = 0
num_ties = 0
for n1, n2 in G.edges():
if n1 in chars and n2 in chars:
if G.has_edge(n1, n2):
num_ties+=1
if chars[n1] == chars[n2]:
num_same_ties+=1
return (num_same_ties / num_ties)
print(homophily_without_ids(G, c_list))
#create node ids map - trivial in this case
nodes_ids = {i:i for i in G.nodes()}
def homophily(G, chars, IDs):
"""
Given a network G, a dict of characteristics chars for node IDs,
and dict of node IDs for each node in the network,
find the homophily of the network.
"""
num_same_ties = 0
num_ties = 0
for n1, n2 in G.edges():
if IDs[n1] in chars and IDs[n2] in chars:
if G.has_edge(n1, n2):
num_ties+=1
if chars[IDs[n1]] == chars[IDs[n2]]:
num_same_ties+=1
return (num_same_ties / num_ties)
print(homophily(G, c_list, nodes_ids))