Search code examples
pandassubtraction

Substract Values within column dependent on column groups


I have the following dataframe with roughly 5000 entries.

df = pd.DataFrame({'name':['Strain A', 'Strain B', 'Strain C', 'Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B', 'Strain C', 'Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B',],
               'Buffer':['PBS', 'PBS', 'PBS', 'Tris', 'Tris', 'Tris', 'Phos','Phos','Phos','PBS', 'PBS', 'PBS', 'Tris', 'Tris', 'Tris', 'Phos','Phos'],
               'Time' :[2,2,2,2,2,2,2,2,2,20,20,20,20,20,20,20,20],
               '%':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17]
               }).set_index('name')



        Buffer  Time   %
name                     
Strain A  PBS    2     1 
Strain B  PBS    2     2 
Strain C  PBS    2     3 
Strain A  Tris   2     4 
Strain B  Tris   2     5 
Strain C  Tris   2     6 
Strain A  Phos   2     7 
Strain B  Phos   2     8 
Strain C  Phos   2     9 
Strain A  PBS    20    10
Strain B  PBS    20    11
Strain C  PBS    20    12
Strain A  Tris   20    13
Strain B  Tris   20    14
Strain C  Tris   20    15
Strain A  Phos   20    16
Strain B  Phos   20    17

I want to create a new df in which I subtract the % value after 20 h time points of the % from the 2h timepoints. I have tried several solutions, that usually end up in NaN I would like to end up with a df similar to this:

         Buffer  %
name              
Strain A  PBS    9
Strain B  PBS    9
Strain C  PBS    9
Strain A  Tris   9
Strain B  Tris   9
Strain C  Tris   9
Strain A  Phos   9
Strain B  Phos   9
Strain C  Phos  -9

Thanks for the help!


Solution

  • I created a unique identifier for the two different 'stacks' in your table, then used that identifier as a merge index. Pandas will add a suffix to keep your names unique. Then you can perform an easy subtraction:

    df = pd.DataFrame({'name':['Strain A', 'Strain B', 'Strain C', 'Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B', 'Strain C', 'Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B',],
               'Buffer':['PBS', 'PBS', 'PBS', 'Tris', 'Tris', 'Tris', 'Phos','Phos','Phos','PBS', 'PBS', 'PBS', 'Tris', 'Tris', 'Tris', 'Phos','Phos'],
               'Time' :[2,2,2,2,2,2,2,2,2,20,20,20,20,20,20,20,20],
               '%':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17]
               }).set_index('name')
    
    
    df["unique"] = df.index + "_" + df["Buffer"]
    df1 = df[df["Time"]==2]
    df2 = df[df["Time"]==20]
    
    df3 = pd.merge(df1,df2,left_on='unique',right_on='unique',how='inner')
    df3["Diff"] = df3["%_y"] - df3["%_x"]
    

    There's probably a shorter cleverer way, but this is probably more instructive.

    EDIT:

    To account for missing data, you could choose how='outer', which will fill NaN's where there are missing values, then use

    df3 = df3.fillna(0.0)
    

    before the subtraction