Search code examples
pythonpandascombinationspython-itertools

Aggregate row pair combinations and calculate difference using pandas


I am using the context and code of this question here: Aggregate all dataframe row pair combinations using pandas

import pandas
import itertools

mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4']

df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],
                       'case1'   : [0,1,1,0,0],
                       'case2'   : [1,1,1,0,1],
                       'control1':[0,0,1,1,1],
                       'control2':[1,0,0,1,0] })
>>> df
   Gene  case1  case2  control1  control2
0  ABC1      0      1         0         1
1  ABC2      1      1         0         0
2  ABC3      1      1         1         0
3  ABC4      0      0         1         1
4  ABC5      0      1         1         0

The solution below sums up the rows, however, in my case, I would need the difference.

>>> df = df.set_index("Gene")
>>> cc = list(combinations(mygenes,2))
>>> out = pd.DataFrame([df.loc[c,:].sum() for c in cc], index=cc)
>>> out
              case1  case2  control1  control2
(ABC1, ABC2)      1      2         0         1
(ABC1, ABC3)      1      2         1         1
(ABC1, ABC4)      0      1         1         2
(ABC2, ABC3)      2      2         1         0
(ABC2, ABC4)      1      1         1         1
(ABC3, ABC4)      1      1         2         1

So in my case, (ABC1, ABC2) would yield [-1,0,0,1]

Unfortunately, using .diff() instead of .sum() results in the following error message:

ValueError: Must pass 2-d input. shape=(10, 2, 8)


Solution

  • It may help to look at what .sum() and .diff() do

    .sum() produces a Series

    >>> df.loc[cc[0], :].sum()
    case1       1
    case2       2
    control1    0
    control2    1
    dtype: int64
    

    .diff() produces a DataFrame

    >>> df.loc[cc[0], :].diff()
          case1  case2  control1  control2
    Gene                                  
    ABC1    NaN    NaN       NaN       NaN
    ABC2    1.0    0.0       0.0      -1.0
    

    You can use the period of -1 with diff and then index the first row of the resulting DataFrame

    >>> df.loc[cc[0], :].diff(-1).iloc[0]
    case1      -1.0
    case2       0.0
    control1    0.0
    control2    1.0
    Name: ABC1, dtype: float64
    
    >>> pd.DataFrame([df.loc[c,:].diff(-1).iloc[0] for c in cc], index=cc)
                  case1  case2  control1  control2
    (ABC1, ABC2)   -1.0    0.0       0.0       1.0
    (ABC1, ABC3)   -1.0    0.0      -1.0       1.0
    (ABC1, ABC4)    0.0    1.0      -1.0       0.0
    (ABC2, ABC3)    0.0    0.0      -1.0       0.0
    (ABC2, ABC4)    1.0    1.0      -1.0      -1.0
    (ABC3, ABC4)    1.0    1.0       0.0      -1.0