Search code examples
pythonpandascategoriesadjacency-matrix

How to create an adjacency matrix in pandas such that the labels are preserved when rows and cols are rearranged


I have never used pandas or numpy for this purpose before and am wondering what's the idiomatic way to construct labeled adjacency matrices in pandas.

My data comes in a shape similar to this. Each "uL22" type of thing is a protein and the the arrays are the neighbors of this protein. Hence( in this example below) an adjacency matrix would have 1s in bL31 row, uL5 column, and the converse, etc.

My problem is twofold:

  1. The actual dimension of the adjacency matrix is dictated by a set of protein-names that is generally much larger than those contained in the nbrtree, so i'm wondering what's the best way to map my nbrtree data to that set, say a 100 by 100 matrix corresponding to neighborhood relationships of a 100 proteins.

  2. I'm not quite sure how to "bind" the names(i.e.uL32etc.) of those 100 proteins to the rows and columns of this matrix such that when I start moving rows around the names move accordingly. ( i'm planning to rearange the adjacency matrix into to have a block-diagonal structure)

"nbrtree": {
        "bL31": ["uL5"],
        "uL5": ["bL31"],
        "bL32": ["uL22"],
        "uL22": ["bL32","bL17"],
         ...
        "bL33": ["bL35"],
        "bL35": ["bL33","uL15"],
        "uL13": ["bL20"],
        "bL20": ["uL13","bL21"]
}
>>>len(nbrtree)
>>>40

I'm sure this is a manipulation that people perform daily, i'm just not quite familiar with how dataframes function properly, so i'm probably looking for something very obvious. Thank you so much!


Solution

  • I don't fully understand your question, But from what I get try out this code.

    from pprint import pprint as pp
    import pandas as pd
    dic = {"first": {
            "a": ["b","d"],
            "b": ["a","h"],
            "c": ["d"],
            "d": ["c","g"],
            "e": ["f"],
            "f": ["e","d"],
            "g": ["h","a"],
            "h": ["g","b"]
    }}
    col = list(dic['first'].keys())
    data = pd.DataFrame(0, index = col, columns = col, dtype = int)
    for x,y in dic['first'].items():
            data.loc[x,y] = 1
    pp(data)
    

    The output from this code being

       a  b  c  d  e  f  g  h
    a  0  1  0  1  0  0  0  0
    b  1  0  0  0  0  0  0  1
    c  0  0  0  1  0  0  0  0
    d  0  0  1  0  0  0  1  0
    e  0  0  0  0  0  1  0  0
    f  0  0  0  1  1  0  0  0
    g  1  0  0  0  0  0  0  1
    h  0  1  0  0  0  0  1  0
    

    Note that this adjaceny matrix here is not symmetric as I have taken some random data

    To absorb your labels into the dataframe change to the following

    data = pd.DataFrame(0, index = ['index']+col, columns = ['column']+col, dtype = int)
    data.loc['index'] = [0]+col
    data.loc[:, 'column'] = ['*']+col