I'm new to working with pandas and I am trying to add a column with groupby() based off of a Spotfire calculated column formula
Suppose I had a table with the following data (df1):
'Well ID','Assay','Source','Treat','BkgrdSub Fluorescence','Calced'
'A1',4,'Source 1','OPA',-215.75,0.035583351
'A2',4,'Source 1','OPA',-160.75,0.130472288
'A3',4,'Source 1','OPA',343.25,1
'H10',6,'Source 1','OPP',9896,1
'H11',6,'Source 1','OPP',9892,0.999605226
'H12',6,'Source 1','CN',-1,1
'A1',4,'Source 2','OPA',-170,0.03682641
'A2',4,'Source 2','OPA',-86,0.083431583
'A3',4,'Source 2','OPA',1566,1
'H10',6,'Source 2','ZI',4885,0.809271732
'H11',6,'Source 2','ZI',6092,1
'H12',6,'Source 2','CN',78,1
'A1',4,'Source 3','OPA',-114.5,0.037329147
'A2',4,'Source 3','OPA',-114.5,0.037329147
'A3',4,'Source 3','OPA',3028.5,1
'H10',6,'Source 3','ZIII',4245.375,0.85305734
'H11',6,'Source 3','ZIII',5017.375,1
'H12',6,'Source 3','CN',20.375,1
'A1',4,'Source 4','OPA',-183.375,0.017731683
'A2',4,'Source 4','OPA',-102.375,0.044831047
'A3',4,'Source 4','OPA',2752.625,1
'H10',6,'Source 4','ZIIII',2635.75,0.697943562
'H11',6,'Source 4','ZIIII',3878.75,1
'H12',6,'Source 4','CN',-10.25,1
'A1',4,'Source 5','OPA',-236.375,0
'A2',4,'Source 5','OPA',-199.375,0.028094153
'A3',4,'Source 5','OPA',1080.625,1
'H10',6,'Source 5','ZV',3489,0.952202946
'H11',6,'Source 5','ZV',3676,1
'H12',6,'Source 5','CN',31,1
'A1',4,'Source 6','OPA',-221.375,0.008870491
'A2',4,'Source 6','OPA',-150.375,0.050857481
'A3',4,'Source 6','OPA',1454.625,1
'H10',6,'Source 6','ZVI',2224.375,1
'H11',6,'Source 6','ZVI',1418.375,0.672457584
'H12',6,'Source 6','CN',716.375,1
I want to be able to add one calculated column that is defined by the Spotfire equation:
([BkgrdSub Fluorescence] - Min([BkgrdSub Fluorescence])) / Max([BkgrdSub Fluorescence] - Min([BkgrdSub Fluorescence])) OVER ([Treat],[Source],[Assay])
I was building the script one step at a time and I got to the point where I tried to run this with groupby():
import pandas as pd
df1.insert(5,"Scaled BckgrdSub Fluorescence min","")
df1['Scaled BckgrdSub Fluorescence min'] = df1.groupby(['Treat','Source','Assay'])['BkgrdSub Fluorescence'].transform('min')
df1.insert(6,"Scaled BckgrdSub Fluorescence eq","")
df1['Scaled BckgrdSub Fluorescence eq'] = df1[['BkgrdSub Fluorescence'] - ['Scaled BckgrdSub Fluorescence min']].groupby(df1['Treat'],df1['Source'],df1['Assay']).transform('max')
But I get the error:
TypeError: unsupported operand type(s) for -: 'list' and 'list'
Which I understand means that I can't subtract a list from a list. So obviously the syntax doesn't support equations within the groupby() function.
I also tried to avoid this syntax error by avoiding using groupby() with 'Scaled BckgrdSub Fluorescence' being the desired outcome column:
df1.insert(5,"Scaled BckgrdSub Fluorescence min","")
df1['Scaled BckgrdSub Fluorescence min'] = df1.groupby(['Treat','Source','Assay'])['BkgrdSub Fluorescence'].transform('min')
df1.insert(6,"Scaled BckgrdSub Fluorescence eq","")
df1['Scaled BckgrdSub Fluorescence eq'] = df1['BkgrdSub Fluorescence'] - df1['Scaled BckgrdSub Fluorescence min']
df1.insert(7,"Scaled BckgrdSub Fluorescence max","")
df1['Scaled BckgrdSub Fluorescence max'] = df1.groupby(['Treat','Source','Assay'])['Scaled BckgrdSub Fluorescence eq'].transform('max')
df1.insert(8,"Scaled BckgrdSub Fluorescence","")
df1['Scaled BckgrdSub Fluorescence'] = df1['Scaled BckgrdSub Fluorescence eq'] / df1['Scaled BckgrdSub Fluorescence max']
However, this isn't the same outcome as the calculated column you get in Spotfire.
The expected output, taken from Spotfire, for the calculated column is the already shown in the 'Calced' column.
So my question is, is there a simple way to add the column that I want with the groupby() function in a few lines while remaining accurate?
Here is one way to duplicate your target numbers:
df1.groupby('Source').apply(lambda x: ((x['BkgrdSub Fluorescence'] - df1['BkgrdSub Fluorescence'].min()) / (x['BkgrdSub Fluorescence'].max() - df1['BkgrdSub Fluorescence'].min())))
Result:
Source1 0 0.035583
1 0.130472
2 1.000000
Source2 3 0.036826
4 0.083432
5 1.000000
Source3 6 0.037329
7 0.037329
8 1.000000
Source4 9 0.017732
10 0.044831
11 1.000000
Source5 12 0.000000
13 0.028094
14 1.000000
Source6 15 0.008870
16 0.050857
17 1.000000
Updated for new dataset:
df1.index.name = 'ID' # will be used for sorting
# group and apply equation
grouped = df1.groupby(['Treat','Source','Assay']).apply(lambda x: ((x['BkgrdSub Fluorescence']
- df1['BkgrdSub Fluorescence'].min()) / (x['BkgrdSub Fluorescence'].max()
- df1['BkgrdSub Fluorescence'].min())))
# remove multi-index from grouped, sort on ID, and insert into df1
df1['Calculated'] = grouped.sort_index(level='ID').reset_index(drop=True)
New result (see Calculated column):
Well ID Assay Source Treat BkgrdSub Fluorescence Calced Calculated
ID
0 A1 4 Source 1 OPA -215.750 0.035583 0.035583
1 A2 4 Source 1 OPA -160.750 0.130472 0.130472
2 A3 4 Source 1 OPA 343.250 1.000000 1.000000
3 H10 6 Source 1 OPP 9896.000 1.000000 1.000000
4 H11 6 Source 1 OPP 9892.000 0.999605 0.999605
5 H12 6 Source 1 CN -1.000 1.000000 1.000000
6 A1 4 Source 2 OPA -170.000 0.036826 0.036826
7 A2 4 Source 2 OPA -86.000 0.083432 0.083432
8 A3 4 Source 2 OPA 1566.000 1.000000 1.000000
9 H10 6 Source 2 ZI 4885.000 0.809272 0.809272
10 H11 6 Source 2 ZI 6092.000 1.000000 1.000000
11 H12 6 Source 2 CN 78.000 1.000000 1.000000
12 A1 4 Source 3 OPA -114.500 0.037329 0.037329
13 A2 4 Source 3 OPA -114.500 0.037329 0.037329
14 A3 4 Source 3 OPA 3028.500 1.000000 1.000000
15 H10 6 Source 3 ZIII 4245.375 0.853057 0.853057
16 H11 6 Source 3 ZIII 5017.375 1.000000 1.000000
17 H12 6 Source 3 CN 20.375 1.000000 1.000000
18 A1 4 Source 4 OPA -183.375 0.017732 0.017732
19 A2 4 Source 4 OPA -102.375 0.044831 0.044831
20 A3 4 Source 4 OPA 2752.625 1.000000 1.000000
21 H10 6 Source 4 ZIIII 2635.750 0.697944 0.697944
22 H11 6 Source 4 ZIIII 3878.750 1.000000 1.000000
23 H12 6 Source 4 CN -10.250 1.000000 1.000000
24 A1 4 Source 5 OPA -236.375 0.000000 0.000000
25 A2 4 Source 5 OPA -199.375 0.028094 0.028094
26 A3 4 Source 5 OPA 1080.625 1.000000 1.000000
27 H10 6 Source 5 ZV 3489.000 0.952203 0.952203
28 H11 6 Source 5 ZV 3676.000 1.000000 1.000000
29 H12 6 Source 5 CN 31.000 1.000000 1.000000
30 A1 4 Source 6 OPA -221.375 0.008870 0.008870
31 A2 4 Source 6 OPA -150.375 0.050857 0.050857
32 A3 4 Source 6 OPA 1454.625 1.000000 1.000000
33 H10 6 Source 6 ZVI 2224.375 1.000000 1.000000
34 H11 6 Source 6 ZVI 1418.375 0.672458 0.672458
35 H12 6 Source 6 CN 716.375 1.000000 1.000000