I have the following dataframe with roughly 5000 entries.
df = pd.DataFrame({'name':['Strain A', 'Strain B', 'Strain C', 'Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B', 'Strain C', 'Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B',],
'Buffer':['PBS', 'PBS', 'PBS', 'Tris', 'Tris', 'Tris', 'Phos','Phos','Phos','PBS', 'PBS', 'PBS', 'Tris', 'Tris', 'Tris', 'Phos','Phos'],
'Time' :[2,2,2,2,2,2,2,2,2,20,20,20,20,20,20,20,20],
'%':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17]
}).set_index('name')
Buffer Time %
name
Strain A PBS 2 1
Strain B PBS 2 2
Strain C PBS 2 3
Strain A Tris 2 4
Strain B Tris 2 5
Strain C Tris 2 6
Strain A Phos 2 7
Strain B Phos 2 8
Strain C Phos 2 9
Strain A PBS 20 10
Strain B PBS 20 11
Strain C PBS 20 12
Strain A Tris 20 13
Strain B Tris 20 14
Strain C Tris 20 15
Strain A Phos 20 16
Strain B Phos 20 17
I want to create a new df in which I subtract the % value after 20 h time points of the % from the 2h timepoints. I have tried several solutions, that usually end up in NaN I would like to end up with a df similar to this:
Buffer %
name
Strain A PBS 9
Strain B PBS 9
Strain C PBS 9
Strain A Tris 9
Strain B Tris 9
Strain C Tris 9
Strain A Phos 9
Strain B Phos 9
Strain C Phos -9
Thanks for the help!
I created a unique identifier for the two different 'stacks' in your table, then used that identifier as a merge
index. Pandas will add a suffix to keep your names unique. Then you can perform an easy subtraction:
df = pd.DataFrame({'name':['Strain A', 'Strain B', 'Strain C', 'Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B', 'Strain C', 'Strain A', 'Strain B', 'Strain C','Strain A', 'Strain B',],
'Buffer':['PBS', 'PBS', 'PBS', 'Tris', 'Tris', 'Tris', 'Phos','Phos','Phos','PBS', 'PBS', 'PBS', 'Tris', 'Tris', 'Tris', 'Phos','Phos'],
'Time' :[2,2,2,2,2,2,2,2,2,20,20,20,20,20,20,20,20],
'%':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17]
}).set_index('name')
df["unique"] = df.index + "_" + df["Buffer"]
df1 = df[df["Time"]==2]
df2 = df[df["Time"]==20]
df3 = pd.merge(df1,df2,left_on='unique',right_on='unique',how='inner')
df3["Diff"] = df3["%_y"] - df3["%_x"]
There's probably a shorter cleverer way, but this is probably more instructive.
To account for missing data, you could choose how='outer'
, which will fill NaN's where there are missing values, then use
df3 = df3.fillna(0.0)
before the subtraction