Search code examples
pandasfunctioncomparisonpairwise

Pandas: Pairwise Comparison Of Subjects To Eliminate Dominated Alternatives


I have a large dataframe of experimental results, which I need to triage to remove 'dominated' subjects across multiple criteria. The following 'toy' dataframe reflects the overall structure but not necessarily the dimensions of the 'experimental' dataframe.

df = pd.DataFrame({'Subject': ['Alpha', 'Bravo', 'Charlie'],
                   'A': [6, 7, 8],
                   'B': [11, 7, 12],
                   'C': [13, 6, 6],
                   'D': [5, 9, 4],
                   'E': [11, 9, 5],
                   'F': [9, 10, 3],
                   'G': [2, 6, 5],
                   'H': [8, 12, 11]})

     Subject  A     B      C     D     E     F     G      H
0    Alpha    6     11     13    5     11    9     2      8
1    Bravo    7     7      6     9     9     10    6      12
2    Charlie  8     12     6     4     5     3     5      11

How do I generate the following results using a 'less than' pairwise comparison.

[0, 1]: w=5, l=3, d=0
[0, 2]: w=4, l=4, d=0
[1, 2]: w=2, l=5, d=1

and combine them with the following pseudocode to create the subset of dominated subjects ['Bravo'] and remove it from the original dataframe?

tx = 3
i = 0

subject[0]='Alpha'
subject[1]='Bravo'

if w > l and l < tx
then y[i] = subject[0]
     z[i] = subject[1]
elseif w < l and w < tx
then y[i] = subject[1]
     z[i] = subject[0]

i += 1

Please advise?


Solution

  • The following code appears to work correctly

    def pairwise_compare(dfq, pairs, tx):
        winners = []
        losers = []
        for pair in pairs:
            w = 0
            l = 0
            x = 0
            for i in dfq['Subject']:
                for j in dfq['Subject']:
                    if i == pair[0] and j == pair[1]:
                        alt_first = dfq.loc[dfq['Subject'] == i, 'A':'H'].values
                        alt_second = dfq.loc[dfq['Subject'] == j, 'A':'H'].values
                        diffs = (alt_first - alt_second).astype(int)
                        w = np.sum(diffs < 0)
                        l = np.sum(diffs > 0)
                        x = np.sum(diffs == 0)
                        if w > l and l < tx:
                            winners.append(i)
                            losers.append(j)
                        elif w < l and w < tx:
                            winners.append(j)
                            losers.append(i)
        return winners, losers
    
    pair_order_list = itertools.combinations(df['Subject'],2)
    pairs = list(pair_order_list)
    
    print('')
    tx = 3
    winners, losers = pairwise_compare(df, pairs, tx)
    
    print('')
    for winner, loser in zip(winners, losers):
        df.drop(df[df['Subject'] == loser].index, inplace=True)
        print(f'{loser} is dominated by {winner}')
    
    df.set_index('Subject', inplace=True)
    print('')
    print(df)
    
    

    and produces the required output.

    Bravo is dominated by Charlie
    
             A   B   C  D   E  F  G   H
    Subject                            
    Alpha    6  11  13  5  11  9  2   8
    Charlie  8  12   6  4   5  3  5  11
    

    I would appreciate it if one of the 'pandas' experts could produce a more idiomatic version!