Identifying unique Pandas DataFrame groups and adding a new column / index accordingly

I've a Pandas DataFrame which contains a list of "solutions" explaining which materials should be added and removed to apply those solutions. Unfortunately, some solutions hold the same materials which should go in and out. Therefore I'd like to figure out which solutions are basically identical (except for the value in column "Solution") and add a new column to make that clear.

My dataset looks like this:

Solution	IN_amount	in_matnr	in_mat	OUT_amount	out_matnr	out_mat
3	1	80000000	QQQ
3	1	22222222	BBB
3	1	33333333	CCC
3	2	44444444	DDD
3	3	55555555	EEE
3				1	18000000	ZZZ
1	1	22222222	BBB
1	1	11000000	SSS
1	2	40000000	MMM
1	2	12000000	TTT
2	1	80000000	QQQ
2	1	22222222	BBB
2	1	33333333	CCC
2	2	44444444	DDD
2	3	55555555	EEE
2				1	18000000	ZZZ

[...]

Here, the solutions 3 and 2 can be seen as identical solutions (the same materials are going in and out). I'd like to add a new column to mark those groups which contain the identical materials. Simplified:

Solution	...	Solution_unique
3	...	1
...	...	...
1	...	2
...	...	...
2	...	1
...	...	...

This is what I have so far:

import pandas as pd

data = [{"Solution": 3, "IN_amount": 1, "in_matnr": 80000000, "in_mat": "QQQ"},
        {"Solution": 3, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 3, "IN_amount": 1, "in_matnr": 33333333, "in_mat": "CCC"},
        {"Solution": 3, "IN_amount": 2, "in_matnr": 44444444, "in_mat": "DDD"},
        {"Solution": 3, "IN_amount": 3, "in_matnr": 55555555, "in_mat": "EEE"},
        {"Solution": 3, "OUT_amount": 1, "out_matnr": 18000000, "out_mat": "ZZZ"},
        {"Solution": 1, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 1, "IN_amount": 1, "in_matnr": 11000000, "in_mat": "SSS"},
        {"Solution": 1, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
        {"Solution": 1, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"},
        {"Solution": 2, "IN_amount": 1, "in_matnr": 80000000, "in_mat": "QQQ"},
        {"Solution": 2, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 2, "IN_amount": 1, "in_matnr": 33333333, "in_mat": "CCC"},
        {"Solution": 2, "IN_amount": 2, "in_matnr": 44444444, "in_mat": "DDD"},
        {"Solution": 2, "IN_amount": 3, "in_matnr": 55555555, "in_mat": "EEE"},
        {"Solution": 2, "OUT_amount": 1, "out_matnr": 18000000, "out_mat": "ZZZ"},
        {"Solution": 5, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 5, "IN_amount": 2, "in_matnr": 90000000, "in_mat": "RRR"},
        {"Solution": 5, "OUT_amount": 1, "out_matnr": 17000000, "out_mat": "YYY"},
        {"Solution": 4, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 4, "IN_amount": 1, "in_matnr": 11000000, "in_mat": "SSS"},
        {"Solution": 4, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
        {"Solution": 4, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"},
        {"Solution": 6, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 6, "IN_amount": 1, "in_matnr": 13000000, "in_mat": "UUU"},
        {"Solution": 6, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
        {"Solution": 6, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"},
        {"Solution": 7, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 7, "IN_amount": 1, "in_matnr": 13000000, "in_mat": "UUU"},
        {"Solution": 7, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
        {"Solution": 7, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"}]

df = pd.DataFrame(data)

df = df.groupby("Solution")

Which gives me something like

  Solution  IN_amount    in_matnr in_mat  OUT_amount   out_matnr out_mat
0          3        1.0  80000000.0    QQQ         NaN         NaN     NaN
1          3        1.0  22222222.0    BBB         NaN         NaN     NaN
2          3        1.0  33333333.0    CCC         NaN         NaN     NaN
3          3        2.0  44444444.0    DDD         NaN         NaN     NaN
4          3        3.0  55555555.0    EEE         NaN         NaN     NaN
5          3        NaN         NaN    NaN         1.0  18000000.0     ZZZ
6          1        1.0  22222222.0    BBB         NaN         NaN     NaN
7          1        1.0  11000000.0    SSS         NaN         NaN     NaN
8          1        2.0  40000000.0    MMM         NaN         NaN     NaN
9          1        2.0  12000000.0    TTT         NaN         NaN     NaN
[...]

Now I'm stuck at counting unique groups and adding a kind of marker inside a new column (Solution_unique).

Thank you in advance, I'm looking forward to your assistance! Max

Solution

I would extract the subdataframes, compute their hashes with pandas.util.hash_pandas_object/hash and use this information to identify the identical groups:

hashes = pd.Series({k: hash(tuple(pd.util.hash_pandas_object(g, index=False)))
                    for k, g in df.sort_values(by=list(df.columns.difference(['Solution'])))
                                  .drop(columns='Solution').groupby(df['Solution'], sort=False)}
                  )

df['Solution_unique'] = df['Solution'].map(dict(zip(hashes.index, pd.factorize(hashes)[0]+1)))

Output:

    Solution  IN_amount    in_matnr in_mat  OUT_amount   out_matnr out_mat  Solution_unique
0          3        1.0  80000000.0    QQQ         NaN         NaN     NaN                1
1          3        1.0  22222222.0    BBB         NaN         NaN     NaN                1
2          3        1.0  33333333.0    CCC         NaN         NaN     NaN                1
3          3        2.0  44444444.0    DDD         NaN         NaN     NaN                1
4          3        3.0  55555555.0    EEE         NaN         NaN     NaN                1
5          3        NaN         NaN    NaN         1.0  18000000.0     ZZZ                1
6          1        1.0  22222222.0    BBB         NaN         NaN     NaN                2
7          1        1.0  11000000.0    SSS         NaN         NaN     NaN                2
8          1        2.0  40000000.0    MMM         NaN         NaN     NaN                2
9          1        2.0  12000000.0    TTT         NaN         NaN     NaN                2
10         2        1.0  80000000.0    QQQ         NaN         NaN     NaN                1
11         2        1.0  22222222.0    BBB         NaN         NaN     NaN                1
12         2        1.0  33333333.0    CCC         NaN         NaN     NaN                1
13         2        2.0  44444444.0    DDD         NaN         NaN     NaN                1
14         2        3.0  55555555.0    EEE         NaN         NaN     NaN                1
15         2        NaN         NaN    NaN         1.0  18000000.0     ZZZ                1
16         5        1.0  22222222.0    BBB         NaN         NaN     NaN                3
17         5        2.0  90000000.0    RRR         NaN         NaN     NaN                3
18         5        NaN         NaN    NaN         1.0  17000000.0     YYY                3
19         4        1.0  22222222.0    BBB         NaN         NaN     NaN                2
20         4        1.0  11000000.0    SSS         NaN         NaN     NaN                2
21         4        2.0  40000000.0    MMM         NaN         NaN     NaN                2
22         4        2.0  12000000.0    TTT         NaN         NaN     NaN                2
23         6        1.0  22222222.0    BBB         NaN         NaN     NaN                4
24         6        1.0  13000000.0    UUU         NaN         NaN     NaN                4
25         6        2.0  40000000.0    MMM         NaN         NaN     NaN                4
26         6        2.0  12000000.0    TTT         NaN         NaN     NaN                4
27         7        1.0  22222222.0    BBB         NaN         NaN     NaN                4
28         7        1.0  13000000.0    UUU         NaN         NaN     NaN                4
29         7        2.0  40000000.0    MMM         NaN         NaN     NaN                4
30         7        2.0  12000000.0    TTT         NaN         NaN     NaN                4

Intermediates:

hashes

3    2408158917466943276
1   -9038486839634175931
2    2408158917466943276
5    2000192472388552548
4   -9038486839634175931
6    1944556807772715310
7    1944556807772715310
dtype: int64


dict(zip(hashes.index, pd.factorize(hashes)[0]+1))

{3: 1, 1: 2, 2: 1, 5: 3, 4: 2, 6: 4, 7: 4}