Search code examples
pythonpandasdataframe

Retain empty lists when unnesting columns in pandas


Background

I have the following dataframe

import pandas as pd
df = pd.DataFrame({'Result' : [['pos', '+', 'pos', 'positive'], ['neg', 'neg'], [], ['pos']], 
                  'P_ID': [1,2,3,4], 
                  'Gene' : [['kras', 'kras', 'kras', 'egfr'], ['brca', 'brca'], [], ['cd133']],
                  'N_ID' : ['A1', 'A2', 'A3', 'A4']

                 })

#rearrange columns
df = df[['P_ID', 'N_ID', 'Gene', 'Result']]
df


  P_ID N_ID         Gene                         Result
0   1   A1  [kras, kras, kras, egfr]    [pos, +, pos, positive]
1   2   A2  [brca, brca]                [neg, neg]
2   3   A3  []                          []
3   4   A4  [cd133]                     [pos]

And I use the following code taken from unnest (explode) multiple list 2.0

df.set_index('P_ID').apply(lambda x: x.apply(pd.Series).stack()).ffill().reset_index().drop('level_1', 1)

Problem

Using the code above, I get close to what I want. However, because the Gene and Result columns in the third row 2 are empty lists [], I get the output of the row above it as seen below:

 P_ID   N_ID Gene   Result
0   1   A1  kras    pos
1   1   A1  kras    +
2   1   A1  kras    pos
3   1   A1  egfr    positive
4   2   A2  brca    neg
5   2   A2  brca    neg
6   3   A3  brca    neg
7   4   A4  cd133   pos

Instead, I would like to get the output below, where the following row 6 3 A3 [] [] reflects the original dataframe df, which contained empty lists

Desired Output

   P_ID N_ID Gene   Result
0   1   A1  kras    pos
1   1   A1  kras    +
2   1   A1  kras    pos
3   1   A1  egfr    positive
4   2   A2  brca    neg
5   2   A2  brca    neg
6   3   A3  []      []
7   4   A4  cd133   pos

Question

How do I get the desired output?


Solution

  • Let's try some stacking and unstacking magic. This preserves empty lists as well.

    (df.set_index(['P_ID', 'N_ID'])
       .stack()
       .str.join(',')
       .str.split(',', expand=True)
       .stack()
       .unstack(-2)
       .reset_index(level=[0, 1])
       .reset_index(drop=True))
    
       P_ID N_ID    Result   Gene
    0  1     A1   pos       kras 
    1  1     A1   +         kras 
    2  1     A1   pos       kras 
    3  1     A1   positive  egfr 
    4  2     A2   neg       brca 
    5  2     A2   neg       brca 
    6  3     A3                  
    7  4     A4   pos       cd133
    

    Details

    First, set the columns that are not to be touched as the index.

    df.set_index(['P_ID', 'N_ID'])
    
                                Result                      Gene
    P_ID N_ID                                                   
    1    A1    [pos, +, pos, positive]  [kras, kras, kras, egfr]
    2    A2    [neg, neg]               [brca, brca]            
    3    A3    []                       []                      
    4    A4    [pos]                    [cd133]                 
    

    Next, stack the rows.

    _.stack()
    
    P_ID  N_ID        
    1     A1    Result    [pos, +, pos, positive] 
                Gene      [kras, kras, kras, egfr]
    2     A2    Result    [neg, neg]              
                Gene      [brca, brca]            
    3     A3    Result    []                      
                Gene      []                      
    4     A4    Result    [pos]                   
                Gene      [cd133]                 
    dtype: object
    

    We have a series now. We need to explode these elements into separate columns. So, first join the lists, then split again. This works assuming your lists elements don't contain commas themselves (if not, find another separator to join and split on).

    _.str.join(',').str.split(',', expand=True)
    
                          0     1     2         3
    P_ID N_ID                                    
    1    A1   Result  pos    +     pos   positive
              Gene    kras   kras  kras  egfr    
    2    A2   Result  neg    neg   None  None    
              Gene    brca   brca  None  None    
    3    A3   Result         None  None  None    
              Gene           None  None  None    
    4    A4   Result  pos    None  None  None    
              Gene    cd133  None  None  None    
    

    We need to get rid of NULL values, so call stack again.

    _.stack()
    
    P_ID  N_ID           
    1     A1    Result  0    pos     
                        1    +       
                        2    pos     
                        3    positive
                Gene    0    kras    
                        1    kras    
                        2    kras    
                        3    egfr    
    2     A2    Result  0    neg     
                        1    neg     
                Gene    0    brca    
                        1    brca    
    3     A3    Result  0            
                Gene    0            
    4     A4    Result  0    pos     
                Gene    0    cd133   
    dtype: object
    

    We're almost there. Now we want the second last level of the index to become our columns, so unstack using unstack(-2) (unstack on the second last level)

    _.unstack(-2)
    
                   Result   Gene
    P_ID N_ID                   
    1    A1   0  pos       kras 
              1  +         kras 
              2  pos       kras 
              3  positive  egfr 
    2    A2   0  neg       brca 
              1  neg       brca 
    3    A3   0                 
    4    A4   0  pos       cd133
    

    Lastly, some housekeeping to obtain our original columns.

    _.reset_index(-1, drop=True).reset_index()
    
       P_ID N_ID    Result   Gene
    0  1     A1   pos       kras 
    1  1     A1   +         kras 
    2  1     A1   pos       kras 
    3  1     A1   positive  egfr 
    4  2     A2   neg       brca 
    5  2     A2   neg       brca 
    6  3     A3                  
    7  4     A4   pos       cd133
    

    If you want the blanks to actually be lists, use applymap:

    _.applymap(lambda x: x if x != '' else []))
    
       P_ID N_ID    Result   Gene
    0  1     A1   pos       kras 
    1  1     A1   +         kras 
    2  1     A1   pos       kras 
    3  1     A1   positive  egfr 
    4  2     A2   neg       brca 
    5  2     A2   neg       brca 
    6  3     A3   []        []   
    7  4     A4   pos       cd133