Search code examples
pythonpandasdataframeanalytics

Subtract dataframe rows by a column value


I have an analysis result table

         Name  Analysis  Result                 Date    Type
0      Doe J.  Albumine    10.6   23.02.2021 8:07:22  before
1      Doe J.  Albumine     6.5   25.03.2021 8:08:09   after
2     Pine C.  Albumine    13.3   25.03.2021 9:17:54  before
3     Pine C.  Albumine    11.0   22.02.2021 9:25:54   after
4  Jackson D.  Albumine    14.2  23.02.2021 10:51:38  before
5   Jackson D  Albumine    12.2           23.03.2021   after
6  Schafer L.  Albumine     8.4  25.02.2021 10:39:39  before
7  Schafer L.  Albumine     9.3  25.03.2021 12:06:15   after

My goal is to calculate the difference between two analyses of each patient (these are all fictional) based on 'Type' column and obtain a following table:

     Name  Before  After  Difference
0  Doe j.    10.6    6.5         3.9

I've tried groupby but didn't succeed. Would appreciate any help.


Solution

  • Use DataFrame.pivot with subtract:

    df = df.pivot('Name','Type','Result').reset_index().rename_axis(columns=None)
    df['diff'] = df['before'].sub(df['after'])
    
    print (df)
             Name  after  before  diff
    0      Doe J.    6.5    10.6   4.1
    1  Jackson D.   12.2    14.2   2.0
    2     Pine C.   11.0    13.3   2.3
    3  Schafer L.    9.3     8.4  -0.9
    

    If error:

    ValueError: Index contains duplicate entries, cannot reshape

    it means there are duplicates, it means for same Name, Type are 2 or more values like:

    print (df)
             Name  Analysis  Result                 Date    Type
    0      Doe J.  Albumine    10.6   23.02.2021 8:07:22  before <- duplicate Doe J., before
    0      Doe J.  Albumine    10.6   23.02.2021 8:07:22  before <- duplicate Doe J., before
    1      Doe J.  Albumine     6.5   25.03.2021 8:08:09   after
    2     Pine C.  Albumine    13.3   25.03.2021 9:17:54  before
    3     Pine C.  Albumine    11.0   22.02.2021 9:25:54   after
    4  Jackson D.  Albumine    14.2  23.02.2021 10:51:38  before
    5  Jackson D.  Albumine    12.2           23.03.2021   after
    6  Schafer L.  Albumine     8.4  25.02.2021 10:39:39  before
    7  Schafer L.  Albumine     9.3  25.03.2021 12:06:15   after
    

    Possible solution with DataFrame.pivot_table and some aggregate function like mean, sum. If need first matched value use aggfunc='first'

    df = df.pivot_table(index='Name',columns='Type',values='Result', aggfunc='sum').reset_index().rename_axis(columns=None)
    df['diff'] = df['before'].sub(df['after'])
    
    print (df)
             Name  after  before  diff
    0      Doe J.    6.5    21.2  14.7 <- 21.2 because sum
    1  Jackson D.   12.2    14.2   2.0
    2     Pine C.   11.0    13.3   2.3
    3  Schafer L.    9.3     8.4  -0.9