Search code examples
pythonpandasdataframelogic

Pandas difference between columns of different categorical variable


I have a dataframe with 1000 columns that looks like this

Item Value1 Value2 Value3 Value4 Value..n
A      12     24     14    50     43
B      14     27     13    48     47
C      10     32     17    89     49
D      17     30     15    23     40

I want to calculate the difference between the values based on the Item variable, something like this

Difference Value1 Value2 Value3 Value4 Value..n
A-B          2      3      1      2      4
A-C          2     ...    ......    ...
A-D
B-C
B-D
C-D

and exclude the rows where if an item has more entries than the others. For example of item A has 100 entries and item B has 150, then the difference would be calculated of the first 100 entries and the next 50 entries of item B would be excluded from calculations.


Solution

  • We can do itertools.combinations, then reindex and create the dataframe

    import itertools
    idx = pd.MultiIndex.from_tuples([*itertools.combinations(df.Item,2)])
    df = df.set_index('Item')
    
    out = pd.DataFrame(df.reindex(idx.get_level_values(0)).values-df.reindex(idx.get_level_values(1)).values,
                       index=idx,
                       columns=df.columns)
    out.index=out.index.map('-'.join)
    out
         Value1  Value2  Value3  Value4  Value
    A-B      -2      -3       1       2     -4
    A-C       2      -8      -3     -39     -6
    A-D      -5      -6      -1      27      3
    B-C       4      -5      -4     -41     -2
    B-D      -3      -3      -2      25      7
    C-D      -7       2       2      66      9