Search code examples
pythonpandasgraphdigraphs

Pandas DataFrame with levels of graph nodes and edges to square matrix


My Googlefu has failed me!

I have a Pandas DataFrame of the form:

Level 1   Level 2   Level 3   Level 4
-------------------------------------
A         B         C         NaN
A         B         D         E
A         B         D         F
G         H         NaN       NaN
G         I         J         K

It basically contains nodes of a graph with the levels depicting an outgoing edge from a level of lower order to a level of a higher order. I want to convert the DataFrame/create a new DataFrame of the form:

     A   B   C   D   E   F   G   H   I   J   K
  ---------------------------------------------
A |  0   1   0   0   0   0   0   0   0   0   0
B |  0   0   1   1   0   0   0   0   0   0   0
C |  0   0   0   0   0   0   0   0   0   0   0
D |  0   0   0   0   1   1   0   0   0   0   0
E |  0   0   0   0   0   0   0   0   0   0   0
F |  0   0   0   0   0   0   0   0   0   0   0
G |  0   0   0   0   0   0   0   1   1   0   0
H |  0   0   0   0   0   0   0   0   0   0   0
I |  0   0   0   0   0   0   0   0   0   1   0
J |  0   0   0   0   0   0   0   0   0   0   1
K |  0   0   0   0   0   0   0   0   0   0   0

A cell containing 1 depicts an outgoing edge from the corresponding row to the corresponding column. Is there a Pythonic way to achieve this without loops and conditions in Pandas?


Solution

  • Try this code:

    df = pd.DataFrame({'level_1':['A', 'A', 'A', 'G', 'G'], 'level_2':['B', 'B', 'B', 'H', 'I'],
        'level_3':['C', 'D', 'D', np.nan, 'J'], 'level_4':[np.nan, 'E', 'F', np.nan, 'K']})
    

    Your input dataframe is:

      level_1 level_2 level_3 level_4
    0       A       B       C     NaN
    1       A       B       D       E
    2       A       B       D       F
    3       G       H     NaN     NaN
    4       G       I       J       K
    

    And the solution is:

    # Get unique values from input dataframe and filter out 'nan' values
    list_nodes = []
    for i_col in df.columns.tolist():
        list_nodes.extend(filter(lambda v: v==v, df[i_col].unique().tolist()))
    
    # Initialize your result dataframe
    df_res = pd.DataFrame(columns=sorted(list_nodes), index=sorted(list_nodes))
    df_res = df_res.fillna(0)
    
    # Get 'index-column' pairs from input dataframe ('nan's are exluded)
    list_indexes = []
    for i_col in range(df.shape[1]-1):
        list_indexes.extend(list(set([tuple(i) for i in df.iloc[:, i_col:i_col+2]\
            .dropna(axis=0).values.tolist()])))
    
    # Use 'index-column' pairs to fill the result dataframe
    for i_list_indexes in list_indexes:
        df_res.set_value(i_list_indexes[0], i_list_indexes[1], 1)
    

    And the final result is:

       A  B  C  D  E  F  G  H  I  J  K
    A  0  1  0  0  0  0  0  0  0  0  0
    B  0  0  1  1  0  0  0  0  0  0  0
    C  0  0  0  0  0  0  0  0  0  0  0
    D  0  0  0  0  1  1  0  0  0  0  0
    E  0  0  0  0  0  0  0  0  0  0  0
    F  0  0  0  0  0  0  0  0  0  0  0
    G  0  0  0  0  0  0  0  1  1  0  0
    H  0  0  0  0  0  0  0  0  0  0  0
    I  0  0  0  0  0  0  0  0  0  1  0
    J  0  0  0  0  0  0  0  0  0  0  1
    K  0  0  0  0  0  0  0  0  0  0  0