I am using the context and code of this question here: Aggregate all dataframe row pair combinations using pandas
import pandas
import itertools
mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4']
df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],
'case1' : [0,1,1,0,0],
'case2' : [1,1,1,0,1],
'control1':[0,0,1,1,1],
'control2':[1,0,0,1,0] })
>>> df
Gene case1 case2 control1 control2
0 ABC1 0 1 0 1
1 ABC2 1 1 0 0
2 ABC3 1 1 1 0
3 ABC4 0 0 1 1
4 ABC5 0 1 1 0
The solution below sums up the rows, however, in my case, I would need the difference.
>>> df = df.set_index("Gene")
>>> cc = list(combinations(mygenes,2))
>>> out = pd.DataFrame([df.loc[c,:].sum() for c in cc], index=cc)
>>> out
case1 case2 control1 control2
(ABC1, ABC2) 1 2 0 1
(ABC1, ABC3) 1 2 1 1
(ABC1, ABC4) 0 1 1 2
(ABC2, ABC3) 2 2 1 0
(ABC2, ABC4) 1 1 1 1
(ABC3, ABC4) 1 1 2 1
So in my case, (ABC1, ABC2) would yield [-1,0,0,1]
Unfortunately, using .diff() instead of .sum() results in the following error message:
ValueError: Must pass 2-d input. shape=(10, 2, 8)
It may help to look at what .sum()
and .diff()
do
.sum()
produces a Series
>>> df.loc[cc[0], :].sum()
case1 1
case2 2
control1 0
control2 1
dtype: int64
.diff()
produces a DataFrame
>>> df.loc[cc[0], :].diff()
case1 case2 control1 control2
Gene
ABC1 NaN NaN NaN NaN
ABC2 1.0 0.0 0.0 -1.0
You can use the period of -1
with diff and then index the first row of the resulting DataFrame
>>> df.loc[cc[0], :].diff(-1).iloc[0]
case1 -1.0
case2 0.0
control1 0.0
control2 1.0
Name: ABC1, dtype: float64
>>> pd.DataFrame([df.loc[c,:].diff(-1).iloc[0] for c in cc], index=cc)
case1 case2 control1 control2
(ABC1, ABC2) -1.0 0.0 0.0 1.0
(ABC1, ABC3) -1.0 0.0 -1.0 1.0
(ABC1, ABC4) 0.0 1.0 -1.0 0.0
(ABC2, ABC3) 0.0 0.0 -1.0 0.0
(ABC2, ABC4) 1.0 1.0 -1.0 -1.0
(ABC3, ABC4) 1.0 1.0 0.0 -1.0