I have never used pandas or numpy for this purpose before and am wondering what's the idiomatic way to construct labeled adjacency matrices in pandas.
My data comes in a shape similar to this. Each "uL22"
type of thing is a protein and the the arrays are the neighbors of this protein. Hence( in this example below) an adjacency matrix would have 1s in bL31
row, uL5
column, and the converse, etc.
My problem is twofold:
The actual dimension of the adjacency matrix is dictated by a set of protein-names that is generally much larger than those contained in the nbrtree
, so i'm wondering what's the best way to map my nbrtree
data to that set, say a 100
by 100
matrix corresponding to neighborhood relationships of a 100
proteins.
I'm not quite sure how to "bind" the names(i.e.uL32
etc.) of those 100 proteins to the rows and columns of this matrix such that when I start moving rows around the names move accordingly. ( i'm planning to rearange the adjacency matrix into to have a block-diagonal structure)
"nbrtree": {
"bL31": ["uL5"],
"uL5": ["bL31"],
"bL32": ["uL22"],
"uL22": ["bL32","bL17"],
...
"bL33": ["bL35"],
"bL35": ["bL33","uL15"],
"uL13": ["bL20"],
"bL20": ["uL13","bL21"]
}
>>>len(nbrtree)
>>>40
I'm sure this is a manipulation that people perform daily, i'm just not quite familiar with how dataframes function properly, so i'm probably looking for something very obvious. Thank you so much!
I don't fully understand your question, But from what I get try out this code.
from pprint import pprint as pp
import pandas as pd
dic = {"first": {
"a": ["b","d"],
"b": ["a","h"],
"c": ["d"],
"d": ["c","g"],
"e": ["f"],
"f": ["e","d"],
"g": ["h","a"],
"h": ["g","b"]
}}
col = list(dic['first'].keys())
data = pd.DataFrame(0, index = col, columns = col, dtype = int)
for x,y in dic['first'].items():
data.loc[x,y] = 1
pp(data)
The output from this code being
a b c d e f g h
a 0 1 0 1 0 0 0 0
b 1 0 0 0 0 0 0 1
c 0 0 0 1 0 0 0 0
d 0 0 1 0 0 0 1 0
e 0 0 0 0 0 1 0 0
f 0 0 0 1 1 0 0 0
g 1 0 0 0 0 0 0 1
h 0 1 0 0 0 0 1 0
Note that this adjaceny matrix here is not symmetric as I have taken some random data
To absorb your labels into the dataframe change to the following
data = pd.DataFrame(0, index = ['index']+col, columns = ['column']+col, dtype = int)
data.loc['index'] = [0]+col
data.loc[:, 'column'] = ['*']+col