Search code examples
pythonnetworkxsocial-networking

Homophily in a social network using python


I am trying to determine the chance of homophily, then the homophily, of a dataset having nodes as keys and colors as values.

Example:

Node  Target   Colors 
A       N        1
N       A        0 
A       D        1
D       A        1
C       X        1
X       C        0
S       D        0
D       S        1
B                0
R       N        2
N       R        2

Colors are associated with the Node column and span from 0 to 2 (int). The steps for calculating the chance of homophily on a characteristic z (in my case Color) are illustrated as follows:

c_list=df[['Node','Colors']].set_index('Node').T.to_dict('list')
print("\nChance of same color:", round(chance_homophily(c_list),2))

where chance_homophily is defined as follows:

#  The function below takes a dictionary with characteristics as keys and the frequency of their occurrence as values.
# Then it computes the chance homophily for that characteristic (color)

def chance_homophily(dataset):
    freq_dict = Counter([tuple(x) for x in dataset.values()])
    df_freq_counter = freq_dict
    c_list = list(df_freq_counter.values())
    
    chance_homophily = 0
    for class_count in c_list:
        chance_homophily += (class_count/sum(c_list))**2
    return chance_homophily

Then the homophily is calculated as follows:

def homophily(G, chars, IDs):
    """
    Given a network G, a dict of characteristics chars for node IDs,
    and dict of node IDs for each node in the network,
    find the homophily of the network.
    """
    num_same_ties = 0
    num_ties = 0
    for n1, n2 in G.edges():
        if IDs[n1] in chars and IDs[n2] in chars:
            if G.has_edge(n1, n2):
                num_ties+=1
                if chars[IDs[n1]] == chars[IDs[n2]]:
                    num_same_ties+=1
    return (num_same_ties / num_ties) 

G should be built from my dataset above (so taking into account both node and target columns). I am not totally familiar with this network property but I think I have missed something in the implementation (e.g., is it correctly taking count of relationships among nodes in the network?). In another example (with different dataset) found on the web

https://campus.datacamp.com/courses/using-python-for-research/case-study-6-social-network-analysis?ex=1

the characteristic is also color (though it is a string, while I have a numeric variable). I do not know if they take into consideration relationship among nodes to determine, maybe using adjacency matrix: this part has not been implemented in my code, where I am using

G = nx.from_pandas_edgelist(df, source='Node', target='Target')

Solution

  • Your code works perfectly fine. The only thing you are missing is the IDs dict, which would map the names of your nodes to the names of the nodes in the graph G. By creating the graph from a pandas edgelist, you are already naming your nodes, as they are in the data.

    This renders the use of the "IDs"dict unnecessary. Check out the example below, one time wihtou the IDs dict and one time with a trivial dict to use the original function:

    import networkx as nx
    import pandas as pd
    from collections import Counter
    
    df = pd.DataFrame({"Node":["A","N","A","D","C","X","S","D","B","R","N"],
                      "Target":["N","A","D","A","X","C","D","S","","N","R"],
                      "Colors":[1,0,1,1,1,0,0,1,0,2,2]})
    
    c_list=df[['Node','Colors']].set_index('Node').T.to_dict('list')
    
    G = nx.from_pandas_edgelist(df, source='Node', target='Target')
    
    def homophily_without_ids(G, chars):
        """
        Given a network G, a dict of characteristics chars for node IDs,
        and dict of node IDs for each node in the network,
        find the homophily of the network.
        """
        num_same_ties = 0
        num_ties = 0
        for n1, n2 in G.edges():
            if n1 in chars and n2 in chars:
                if G.has_edge(n1, n2):
                    num_ties+=1
                    if chars[n1] == chars[n2]:
                        num_same_ties+=1
        return (num_same_ties / num_ties)
    
    print(homophily_without_ids(G, c_list))
    
    
    #create node ids map - trivial in this case
    nodes_ids = {i:i for i in G.nodes()}
    
    def homophily(G, chars, IDs):
        """
        Given a network G, a dict of characteristics chars for node IDs,
        and dict of node IDs for each node in the network,
        find the homophily of the network.
        """
        num_same_ties = 0
        num_ties = 0
        for n1, n2 in G.edges():
            if IDs[n1] in chars and IDs[n2] in chars:
                if G.has_edge(n1, n2):
                    num_ties+=1
                    if chars[IDs[n1]] == chars[IDs[n2]]:
                        num_same_ties+=1
        return (num_same_ties / num_ties) 
    
    print(homophily(G, c_list, nodes_ids))