Search code examples
pythonpandasdataframemulti-index

Multiindex data.frame from two data.frames join by column headers


There are a dozens similar sounding questions here, I think I've searched them all and could not find a solution to my problem:

I have 2 df: df_c:

        CAN-01  CAN-02  CAN-03
CE                            
ce1     0.84     0.73     0.50
ce2     0.06     0.13     0.05

And df_z:

        CAN-01  CAN-02  CAN-03
marker                        
cell1     0.29     1.5       7
cell2     1.00     3.0       1

I want to join for each 'marker' + 'CE' combination over their column names

Example: cell1 + ce1:

[[0.29, 0.84],[1.5,0.73],[7,0.5], ...]

(Continuing for cell1 + ce2, cell2 + ce1, cell2 + ce2)

I have a working example using two loops and .loc twice, but it takes forever on the full data set.

I think the best to build is a multiindex DF with some merge/join/concat magic:

                CAN-01  CAN-02  CAN-03
     Source                            
0    CE           0.84    0.73    0.50
     Marker       0.29    1.5     7
1    CE           ...     
     Marker       ...     

Sample Code

dc = [['ce1', 0.84, 0.73, 0.5],['c2', 0.06,0.13,0.05]] 
dat_c = pd.DataFrame(dc, columns=['CE', 'CAN-01', 'CAN-02', 'CAN-03']) 
dat_c.set_index('CE',inplace=True)

dz = [['cell1', 0.29, 1.5, 7],['cell2', 1, 3, 1]]
dat_z = pd.DataFrame(dz, columns=['marker', "CAN-01", "CAN-02", "CAN-03"])
dat_z.set_index('marker',inplace=True)

Bad/Slow Solution

for ci, c_row in dat_c.iterrows(): # for each CE in CE table

    tmp = []

    for j,colz in enumerate(dat_z.columns[1:]):

        if not colz in dat_c:
            continue
        entry_c = c_row.loc[colz]
    
        if len(entry_c.shape) > 0:
            continue
        
        tmp.append([dat_z.loc[marker,colz],entry_c])

Solution

  • IIUC:

    use append()+groupby():

    dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
    df=dat_c.append(dat_z).groupby(level=0).agg(list)
    

    output of df:

             CAN-01         CAN-02          CAN-03
    cell1   [0.84, 0.29]    [0.73, 1.5]     [0.5, 7.0]
    cell2   [0.06, 1.0]     [0.13, 3.0]     [0.05, 1.0]
    

    If needed list:

    dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
    lst=dat_c.append(dat_z).groupby(level=0).agg(list).to_numpy().tolist()
    

    output of lst:

    [[[0.84, 0.29], [0.73, 1.5], [0.5, 7.0]],
     [[0.06, 1.0], [0.13, 3.0], [0.05, 1.0]]]