Search code examples
python-3.xpandasnumpymatplotlibchord-diagram

How to create the matrix for chord diagram based on column value:


Say I have a data-frame which has data in the following format.

UID | Name | ID
----------------
1 | ABC | IM-1
2 | XYZ | IM-2
3 | XYZ | IM-2
4 | PQR | IM-3
5 | PQR | IM-4
6 | PQR | IM-5
7 | XYZ | IM-5
8 | ABC | IM-5

I need to create a matrix that feeds into the chord diagram code. that requires the output in the following format:

(array([[0,1,1,1],
        [1,1,1,0],
        [1,1,0,2]]),['ABC','XYZ','PQR'])

Note: In this example, - the "Name" is finite in the list (i.e. ABC, XYZ or PQR) - "ID" is shared between records - the fourth column is the number of records that stand alone (for example ABC is part of a single record IM-1 and PQR appears twice in IM-4 and IM-5 - the other members of the matrix are the linkages between Names based on ID (for example IM-5, increases the value of PQR-XYZ, XYZ-PQR, PQR-ABC,ABC-PQR,XYZ-ABC & ABC-XYZ) - the goal is to create a chord diagram for the connections between the "Name" field

I know this is quite a read. Thanks in advance for your help.


Solution

  • Updated my answer but the approach is basically the same. Parse the data into a data frame, do an inner join on ID to get the pairs of names that are linked by sharing a common ID. Then convert this edge list into an adjacency matrix. Finally some faffing around to get the "dangling" edges, i.e. the ID with only a single occurrence (added in the updated answer), and group their counts by the corresponding Name.

    #!/usr/bin/env python
    """
    Create adjacency matrix from a dataframe, where edges are implicitly defined by shared attributes.
    
    Answer to:
    https://stackoverflow.com/questions/57849602/how-to-create-the-matrix-for-chord-diagram-based-on-column-value
    """
    import numpy as np
    import pandas as pd
    from collections import Counter
    
    def parse_data_format(file_path):
        # read data skipping second line
        df = pd.read_csv(file_path, sep='|', skiprows=[1])
    
        # strip whitespace from column names
        df = df.rename(columns=lambda x: x.strip())
    
        # strip whitespace from values
        df_obj = df.select_dtypes(['object'])
        df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
    
        return df
    
    
    def get_edges(df):
        """Get all combinations of 'Name' that share a 'ID' value (using an inner join)."""
        inner_self_join = df.merge(df, how='inner', on='ID')
        excluding_self_pairs = inner_self_join[inner_self_join['UID_x']!=inner_self_join['UID_y']]
        edges = excluding_self_pairs[['Name_x', 'Name_y']].values
        return edges
    
    
    def get_adjacency(edges):
        "Convert a list of 2-tuples specifying source and target of a connection into an adjacency matrix."
        order = np.unique(edges)
        total_names = len(order)
        name_to_idx = dict(list(zip(order, range(total_names))))
        adjacency = np.zeros((total_names, total_names))
        for (source, target) in edges:
            adjacency[name_to_idx[source], name_to_idx[target]] += 1
        return adjacency, order
    
    
    def get_dangling_edge_counts(df):
        # get IDs with count 1
        counts = Counter(df['ID'].values)
        singles = [ID for (ID, count) in counts.items() if count == 1]
        # get corresponding names
        names = [df[df['ID']==ID]['Name'].values[0] for ID in singles]
        # convert into counts
        return Counter(names)
    
    
    if __name__ == '__main__':
    
        # here we read in the data as a file buffer;
        # however, normally we would hand a file path to parse_data_format instead
        import sys
        if sys.version_info[0] < 3:
            from StringIO import StringIO
        else:
            from io import StringIO
    
        data = StringIO(
            """UID | Name | ID
            ----------------
            1 | ABC | IM-1
            2 | XYZ | IM-2
            3 | XYZ | IM-2
            4 | PQR | IM-3
            5 | PQR | IM-4
            6 | PQR | IM-5
            7 | XYZ | IM-5
            8 | ABC | IM-5
            """
        )
    
        df = parse_data_format(data)
        edges = get_edges(df)
        adjacency, order = get_adjacency(edges)
        print(adjacency)
        # [[0. 1. 1.]
        #  [1. 0. 1.]
        #  [1. 1. 0.]]
        print(order)
        # ['ABC' 'PQR' 'XYZ']
    
        dangling_edge_counts = get_dangling_edge_counts(df)
        print(dangling_edge_counts)
        # Counter({'PQR': 2, 'ABC': 1})
    
        last_column = np.zeros_like(order, dtype=np.int)
        for ii, name in enumerate(order):
            if name in dangling_edge_counts:
                last_column[ii] = dangling_edge_counts[name]
        combined = np.concatenate([adjacency, last_column[:, np.newaxis]], axis=-1)
        print(combined)
        #[[0. 1. 1. 1.]
        # [1. 0. 1. 2.]
        # [1. 1. 2. 0.]]