Search code examples
pythonpandasdataframemergedummy-variable

Compare two dataframe and transpose each value as column by filling binary when there is a match?


I have two dataframe as follows:

df1

SYMBOL  seqnames    start      end    SampleID
SPATA21   1       16736303  16736303    eAPD114
E2F2      1       23836607  23836607    eAPD114
FCN3      1       27701288  27701288    eAPD120
MARCKSL   1       32800671  32800671    KAPD144
MARCKSL   1       32800671  32800671    eAPD184
LRRC40    1       70644607  70644607    eAPD184
KREMEN1  22       29536275  29536275    eAPD005
KIF14     1       200569584 200569584   eAPD081
RGS7BP    5       63802465  63802465    YAPD025
PCDHB6    5       140531231 140531231   YAPD025
SERPINB4 18       61305310  61305310    eAPD081

df2

SYMBOL  seqnames    start      end
SPATA21   1       16736303  16736303
E2F2      1       23836607  23836607
FCN3      1       27701288  27701288
MARCKSL   1       32800671  32800671
LRRC40    1       70644607  70644607
KREMEN1  22       29536275  29536275
SERPINB4 18       61305310  61305310
SERPINB4 21       61305310  61305310

I want to map df1 to df2 and represent each SampleID as separate column with filling 0 and 1 if there is a match between df1 and df2:

Expected output:

SYMBOL  seqnames    start      end      eAPD114 eAPD120 KAPD144 eAPD184 eAPD005 eAPD081 YAPD025
SPATA21   1       16736303  16736303       1       0       0       0       0       0       0
E2F2      1       23836607  23836607       1       0       0       0       0       0       0
FCN3      1       27701288  27701288       0       1       0       0       0       0       0
MARCKSL   1       32800671  32800671       0       0       1       1       0       0       0
LRRC40    1       70644607  70644607       0       0       0       1       0       0       0
KREMEN1  22       29536275  29536275       0       0       0       0       1       0       0
SERPINB4 18       61305310  61305310       0       0       0       0       0       1       0
SERPINB4 21       61305310  61305310       0       0       0       0       0       0       0

I tried using the pivot method mentioned here . But did not work efficiently


Solution

  • Use:

    cols = ['SYMBOL','seqnames','start','end']
    #left join between both DataFrames
    df = df2.merge(df1, on=cols, how='left')
    #convert column SampleID to indicators,get max and last add missing df1['SampleID']
    df = (df.join(pd.get_dummies(df.pop('SampleID')))
            .groupby(cols).max()
            .reindex(df1['SampleID'].unique(), axis=1, fill_value=0)
            .reset_index())
    print (df)
         SYMBOL  seqnames     start       end  eAPD114  eAPD120  KAPD144  eAPD184  \
    0      E2F2         1  23836607  23836607        1        0        0        0   
    1      FCN3         1  27701288  27701288        0        1        0        0   
    2   KREMEN1        22  29536275  29536275        0        0        0        0   
    3    LRRC40         1  70644607  70644607        0        0        0        1   
    4   MARCKSL         1  32800671  32800671        0        0        1        1   
    5  SERPINB4        18  61305310  61305310        0        0        0        0   
    6  SERPINB4        21  61305310  61305310        0        0        0        0   
    7   SPATA21         1  16736303  16736303        1        0        0        0   
    
       eAPD005  eAPD081  YAPD025  
    0        0        0        0  
    1        0        0        0  
    2        1        0        0  
    3        0        0        0  
    4        0        0        0  
    5        0        1        0  
    6        0        0        0  
    7        0        0        0