Search code examples
pythonpandasgroup-bydatatablecomparison

dataframe: do comparison of values within groups


INPUT DATA:

data = {'G1': ['a', 'a', 'a', 'a', 'a', 'b', 'b'], 'G2': ['b', 'b', 'c', 'c', 'd', 'c', 'c'], 'V1': [5, 15, 10, 20, 5, 10, 10], 'V2': [15, 5, 300, 10, 5, 10, 10]}

   | G1  G2  V1   V2
-- + --  --  --  ---
 0 | a   b    5   15
 1 | a   b   15    5
 2 | a   c   10  300
 3 | a   c   20   10
 4 | a   d    5    5
 5 | b   c   10   10
 6 | b   c   10   10

OUTPUT

   | G1  G2  V1   V2  E1  E2
-- + --  --  --  ---  --  --
 0 | a   b    5   15   1   1
 1 | a   b   15    5   1   1
 2 | a   c   10  300   1   0
 3 | a   c   20   10   0   1
 4 | a   d    5    5  NA  NA
 5 | b   c   10   10   1   1
 6 | b   c   10   10   1   1

DESCRIPTION:

How can I compute the columns E1 and E2 based on the following conditions:

groups with only one row should be ignored, and have None or NA in E1 and E2

groups with two rows:

  • if the value in the first row of V1 is equal to the value in the second row of V2, E1 should be 1 (or True) in the first row and E2 in the second.

  • if the value in the second row of V1 is equal to the value in the first row of V2, E1 should be 1 (or True) in the second row and E2 in the first.

  • If the values are not equal the respective positions should be 0 (or False).

If the description should not be clear enough, please ask.

Thank you very much.


Solution

  • Probably not the best performance wise, but here is one way using a custom function using np.diag:

    def func(arr):
        arr = arr.to_numpy()
        if len(arr)<2: return pd.DataFrame([[np.NaN, np.NaN]])
        x, y = np.diag(arr), np.diag(np.fliplr(arr))
        return pd.DataFrame([[np.all(x==x[0]), np.all(y==y[0])],
                             [np.all(y==y[0]), np.all(x==x[0])]])
    
    df[["E1", "E2"]] = df.groupby(["G1","G2"])[["V1","V2"]].apply(func).reset_index(drop=True)
    
    print (df)
    
      G1 G2  V1   V2   E1   E2
    0  a  b   5   15  1.0  1.0
    1  a  b  15    5  1.0  1.0
    2  a  c  10  300  1.0  0.0
    3  a  c  20   10  0.0  1.0
    4  a  d   5    5  NaN  NaN
    5  b  c  10   10  1.0  1.0
    6  b  c  10   10  1.0  1.0