Search code examples
pythonpandasdataframepandas-groupbyspotfire

Adding a calculated column of an equation using pandas groupby()


I'm new to working with pandas and I am trying to add a column with groupby() based off of a Spotfire calculated column formula

Suppose I had a table with the following data (df1):

'Well ID','Assay','Source','Treat','BkgrdSub Fluorescence','Calced'
'A1',4,'Source 1','OPA',-215.75,0.035583351
'A2',4,'Source 1','OPA',-160.75,0.130472288
'A3',4,'Source 1','OPA',343.25,1
'H10',6,'Source 1','OPP',9896,1
'H11',6,'Source 1','OPP',9892,0.999605226
'H12',6,'Source 1','CN',-1,1
'A1',4,'Source 2','OPA',-170,0.03682641
'A2',4,'Source 2','OPA',-86,0.083431583
'A3',4,'Source 2','OPA',1566,1
'H10',6,'Source 2','ZI',4885,0.809271732
'H11',6,'Source 2','ZI',6092,1
'H12',6,'Source 2','CN',78,1
'A1',4,'Source 3','OPA',-114.5,0.037329147
'A2',4,'Source 3','OPA',-114.5,0.037329147
'A3',4,'Source 3','OPA',3028.5,1
'H10',6,'Source 3','ZIII',4245.375,0.85305734
'H11',6,'Source 3','ZIII',5017.375,1
'H12',6,'Source 3','CN',20.375,1
'A1',4,'Source 4','OPA',-183.375,0.017731683
'A2',4,'Source 4','OPA',-102.375,0.044831047
'A3',4,'Source 4','OPA',2752.625,1
'H10',6,'Source 4','ZIIII',2635.75,0.697943562
'H11',6,'Source 4','ZIIII',3878.75,1
'H12',6,'Source 4','CN',-10.25,1
'A1',4,'Source 5','OPA',-236.375,0
'A2',4,'Source 5','OPA',-199.375,0.028094153
'A3',4,'Source 5','OPA',1080.625,1
'H10',6,'Source 5','ZV',3489,0.952202946
'H11',6,'Source 5','ZV',3676,1
'H12',6,'Source 5','CN',31,1
'A1',4,'Source 6','OPA',-221.375,0.008870491
'A2',4,'Source 6','OPA',-150.375,0.050857481
'A3',4,'Source 6','OPA',1454.625,1
'H10',6,'Source 6','ZVI',2224.375,1
'H11',6,'Source 6','ZVI',1418.375,0.672457584
'H12',6,'Source 6','CN',716.375,1

I want to be able to add one calculated column that is defined by the Spotfire equation:

([BkgrdSub Fluorescence] - Min([BkgrdSub Fluorescence])) / Max([BkgrdSub Fluorescence] - Min([BkgrdSub Fluorescence])) OVER ([Treat],[Source],[Assay])

I was building the script one step at a time and I got to the point where I tried to run this with groupby():

import pandas as pd

df1.insert(5,"Scaled BckgrdSub Fluorescence min","")
df1['Scaled BckgrdSub Fluorescence min'] = df1.groupby(['Treat','Source','Assay'])['BkgrdSub Fluorescence'].transform('min')
df1.insert(6,"Scaled BckgrdSub Fluorescence eq","")
df1['Scaled BckgrdSub Fluorescence eq'] = df1[['BkgrdSub Fluorescence'] - ['Scaled BckgrdSub Fluorescence min']].groupby(df1['Treat'],df1['Source'],df1['Assay']).transform('max')

But I get the error:

TypeError: unsupported operand type(s) for -: 'list' and 'list'

Which I understand means that I can't subtract a list from a list. So obviously the syntax doesn't support equations within the groupby() function.

I also tried to avoid this syntax error by avoiding using groupby() with 'Scaled BckgrdSub Fluorescence' being the desired outcome column:

df1.insert(5,"Scaled BckgrdSub Fluorescence min","")
df1['Scaled BckgrdSub Fluorescence min'] = df1.groupby(['Treat','Source','Assay'])['BkgrdSub Fluorescence'].transform('min')
df1.insert(6,"Scaled BckgrdSub Fluorescence eq","")
df1['Scaled BckgrdSub Fluorescence eq'] = df1['BkgrdSub Fluorescence'] - df1['Scaled BckgrdSub Fluorescence min']
df1.insert(7,"Scaled BckgrdSub Fluorescence max","")
df1['Scaled BckgrdSub Fluorescence max'] = df1.groupby(['Treat','Source','Assay'])['Scaled BckgrdSub Fluorescence eq'].transform('max')
df1.insert(8,"Scaled BckgrdSub Fluorescence","")
df1['Scaled BckgrdSub Fluorescence'] = df1['Scaled BckgrdSub Fluorescence eq'] / df1['Scaled BckgrdSub Fluorescence max']

However, this isn't the same outcome as the calculated column you get in Spotfire.

The expected output, taken from Spotfire, for the calculated column is the already shown in the 'Calced' column.

So my question is, is there a simple way to add the column that I want with the groupby() function in a few lines while remaining accurate?


Solution

  • Here is one way to duplicate your target numbers:

    df1.groupby('Source').apply(lambda x: ((x['BkgrdSub Fluorescence'] - df1['BkgrdSub Fluorescence'].min()) / (x['BkgrdSub Fluorescence'].max() - df1['BkgrdSub Fluorescence'].min())))
    

    Result:

    Source1  0     0.035583
             1     0.130472
             2     1.000000
    Source2  3     0.036826
             4     0.083432
             5     1.000000
    Source3  6     0.037329
             7     0.037329
             8     1.000000
    Source4  9     0.017732
             10    0.044831
             11    1.000000
    Source5  12    0.000000
             13    0.028094
             14    1.000000
    Source6  15    0.008870
             16    0.050857
             17    1.000000
    

    Updated for new dataset:

    df1.index.name = 'ID' # will be used for sorting
    # group and apply equation
    grouped = df1.groupby(['Treat','Source','Assay']).apply(lambda x: ((x['BkgrdSub Fluorescence'] 
          - df1['BkgrdSub Fluorescence'].min()) / (x['BkgrdSub Fluorescence'].max() 
          - df1['BkgrdSub Fluorescence'].min())))
    # remove multi-index from grouped, sort on ID, and insert into df1
    df1['Calculated'] = grouped.sort_index(level='ID').reset_index(drop=True)
    

    New result (see Calculated column):

       Well ID  Assay    Source  Treat  BkgrdSub Fluorescence    Calced   Calculated
    ID                                                                              
    0       A1      4  Source 1    OPA               -215.750  0.035583     0.035583
    1       A2      4  Source 1    OPA               -160.750  0.130472     0.130472
    2       A3      4  Source 1    OPA                343.250  1.000000     1.000000
    3      H10      6  Source 1    OPP               9896.000  1.000000     1.000000
    4      H11      6  Source 1    OPP               9892.000  0.999605     0.999605
    5      H12      6  Source 1     CN                 -1.000  1.000000     1.000000
    6       A1      4  Source 2    OPA               -170.000  0.036826     0.036826
    7       A2      4  Source 2    OPA                -86.000  0.083432     0.083432
    8       A3      4  Source 2    OPA               1566.000  1.000000     1.000000
    9      H10      6  Source 2     ZI               4885.000  0.809272     0.809272
    10     H11      6  Source 2     ZI               6092.000  1.000000     1.000000
    11     H12      6  Source 2     CN                 78.000  1.000000     1.000000
    12      A1      4  Source 3    OPA               -114.500  0.037329     0.037329
    13      A2      4  Source 3    OPA               -114.500  0.037329     0.037329
    14      A3      4  Source 3    OPA               3028.500  1.000000     1.000000
    15     H10      6  Source 3   ZIII               4245.375  0.853057     0.853057
    16     H11      6  Source 3   ZIII               5017.375  1.000000     1.000000
    17     H12      6  Source 3     CN                 20.375  1.000000     1.000000
    18      A1      4  Source 4    OPA               -183.375  0.017732     0.017732
    19      A2      4  Source 4    OPA               -102.375  0.044831     0.044831
    20      A3      4  Source 4    OPA               2752.625  1.000000     1.000000
    21     H10      6  Source 4  ZIIII               2635.750  0.697944     0.697944
    22     H11      6  Source 4  ZIIII               3878.750  1.000000     1.000000
    23     H12      6  Source 4     CN                -10.250  1.000000     1.000000
    24      A1      4  Source 5    OPA               -236.375  0.000000     0.000000
    25      A2      4  Source 5    OPA               -199.375  0.028094     0.028094
    26      A3      4  Source 5    OPA               1080.625  1.000000     1.000000
    27     H10      6  Source 5     ZV               3489.000  0.952203     0.952203
    28     H11      6  Source 5     ZV               3676.000  1.000000     1.000000
    29     H12      6  Source 5     CN                 31.000  1.000000     1.000000
    30      A1      4  Source 6    OPA               -221.375  0.008870     0.008870
    31      A2      4  Source 6    OPA               -150.375  0.050857     0.050857
    32      A3      4  Source 6    OPA               1454.625  1.000000     1.000000
    33     H10      6  Source 6    ZVI               2224.375  1.000000     1.000000
    34     H11      6  Source 6    ZVI               1418.375  0.672458     0.672458
    35     H12      6  Source 6     CN                716.375  1.000000     1.000000