I've a Pandas DataFrame which contains a list of "solutions" explaining which materials should be added and removed to apply those solutions. Unfortunately, some solutions hold the same materials which should go in and out. Therefore I'd like to figure out which solutions are basically identical (except for the value in column "Solution") and add a new column to make that clear.
My dataset looks like this:
Solution | IN_amount | in_matnr | in_mat | OUT_amount | out_matnr | out_mat |
---|---|---|---|---|---|---|
3 | 1 | 80000000 | QQQ | |||
3 | 1 | 22222222 | BBB | |||
3 | 1 | 33333333 | CCC | |||
3 | 2 | 44444444 | DDD | |||
3 | 3 | 55555555 | EEE | |||
3 | 1 | 18000000 | ZZZ | |||
1 | 1 | 22222222 | BBB | |||
1 | 1 | 11000000 | SSS | |||
1 | 2 | 40000000 | MMM | |||
1 | 2 | 12000000 | TTT | |||
2 | 1 | 80000000 | QQQ | |||
2 | 1 | 22222222 | BBB | |||
2 | 1 | 33333333 | CCC | |||
2 | 2 | 44444444 | DDD | |||
2 | 3 | 55555555 | EEE | |||
2 | 1 | 18000000 | ZZZ |
[...]
Here, the solutions 3 and 2 can be seen as identical solutions (the same materials are going in and out). I'd like to add a new column to mark those groups which contain the identical materials. Simplified:
Solution | ... | Solution_unique |
---|---|---|
3 | ... | 1 |
... | ... | ... |
1 | ... | 2 |
... | ... | ... |
2 | ... | 1 |
... | ... | ... |
This is what I have so far:
import pandas as pd
data = [{"Solution": 3, "IN_amount": 1, "in_matnr": 80000000, "in_mat": "QQQ"},
{"Solution": 3, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
{"Solution": 3, "IN_amount": 1, "in_matnr": 33333333, "in_mat": "CCC"},
{"Solution": 3, "IN_amount": 2, "in_matnr": 44444444, "in_mat": "DDD"},
{"Solution": 3, "IN_amount": 3, "in_matnr": 55555555, "in_mat": "EEE"},
{"Solution": 3, "OUT_amount": 1, "out_matnr": 18000000, "out_mat": "ZZZ"},
{"Solution": 1, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
{"Solution": 1, "IN_amount": 1, "in_matnr": 11000000, "in_mat": "SSS"},
{"Solution": 1, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
{"Solution": 1, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"},
{"Solution": 2, "IN_amount": 1, "in_matnr": 80000000, "in_mat": "QQQ"},
{"Solution": 2, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
{"Solution": 2, "IN_amount": 1, "in_matnr": 33333333, "in_mat": "CCC"},
{"Solution": 2, "IN_amount": 2, "in_matnr": 44444444, "in_mat": "DDD"},
{"Solution": 2, "IN_amount": 3, "in_matnr": 55555555, "in_mat": "EEE"},
{"Solution": 2, "OUT_amount": 1, "out_matnr": 18000000, "out_mat": "ZZZ"},
{"Solution": 5, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
{"Solution": 5, "IN_amount": 2, "in_matnr": 90000000, "in_mat": "RRR"},
{"Solution": 5, "OUT_amount": 1, "out_matnr": 17000000, "out_mat": "YYY"},
{"Solution": 4, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
{"Solution": 4, "IN_amount": 1, "in_matnr": 11000000, "in_mat": "SSS"},
{"Solution": 4, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
{"Solution": 4, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"},
{"Solution": 6, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
{"Solution": 6, "IN_amount": 1, "in_matnr": 13000000, "in_mat": "UUU"},
{"Solution": 6, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
{"Solution": 6, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"},
{"Solution": 7, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
{"Solution": 7, "IN_amount": 1, "in_matnr": 13000000, "in_mat": "UUU"},
{"Solution": 7, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
{"Solution": 7, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"}]
df = pd.DataFrame(data)
df = df.groupby("Solution")
Which gives me something like
Solution IN_amount in_matnr in_mat OUT_amount out_matnr out_mat
0 3 1.0 80000000.0 QQQ NaN NaN NaN
1 3 1.0 22222222.0 BBB NaN NaN NaN
2 3 1.0 33333333.0 CCC NaN NaN NaN
3 3 2.0 44444444.0 DDD NaN NaN NaN
4 3 3.0 55555555.0 EEE NaN NaN NaN
5 3 NaN NaN NaN 1.0 18000000.0 ZZZ
6 1 1.0 22222222.0 BBB NaN NaN NaN
7 1 1.0 11000000.0 SSS NaN NaN NaN
8 1 2.0 40000000.0 MMM NaN NaN NaN
9 1 2.0 12000000.0 TTT NaN NaN NaN
[...]
Now I'm stuck at counting unique groups and adding a kind of marker inside a new column (Solution_unique).
Thank you in advance, I'm looking forward to your assistance! Max
I would extract the subdataframes, compute their hashes with pandas.util.hash_pandas_object
/hash
and use this information to identify the identical groups:
hashes = pd.Series({k: hash(tuple(pd.util.hash_pandas_object(g, index=False)))
for k, g in df.sort_values(by=list(df.columns.difference(['Solution'])))
.drop(columns='Solution').groupby(df['Solution'], sort=False)}
)
df['Solution_unique'] = df['Solution'].map(dict(zip(hashes.index, pd.factorize(hashes)[0]+1)))
Output:
Solution IN_amount in_matnr in_mat OUT_amount out_matnr out_mat Solution_unique
0 3 1.0 80000000.0 QQQ NaN NaN NaN 1
1 3 1.0 22222222.0 BBB NaN NaN NaN 1
2 3 1.0 33333333.0 CCC NaN NaN NaN 1
3 3 2.0 44444444.0 DDD NaN NaN NaN 1
4 3 3.0 55555555.0 EEE NaN NaN NaN 1
5 3 NaN NaN NaN 1.0 18000000.0 ZZZ 1
6 1 1.0 22222222.0 BBB NaN NaN NaN 2
7 1 1.0 11000000.0 SSS NaN NaN NaN 2
8 1 2.0 40000000.0 MMM NaN NaN NaN 2
9 1 2.0 12000000.0 TTT NaN NaN NaN 2
10 2 1.0 80000000.0 QQQ NaN NaN NaN 1
11 2 1.0 22222222.0 BBB NaN NaN NaN 1
12 2 1.0 33333333.0 CCC NaN NaN NaN 1
13 2 2.0 44444444.0 DDD NaN NaN NaN 1
14 2 3.0 55555555.0 EEE NaN NaN NaN 1
15 2 NaN NaN NaN 1.0 18000000.0 ZZZ 1
16 5 1.0 22222222.0 BBB NaN NaN NaN 3
17 5 2.0 90000000.0 RRR NaN NaN NaN 3
18 5 NaN NaN NaN 1.0 17000000.0 YYY 3
19 4 1.0 22222222.0 BBB NaN NaN NaN 2
20 4 1.0 11000000.0 SSS NaN NaN NaN 2
21 4 2.0 40000000.0 MMM NaN NaN NaN 2
22 4 2.0 12000000.0 TTT NaN NaN NaN 2
23 6 1.0 22222222.0 BBB NaN NaN NaN 4
24 6 1.0 13000000.0 UUU NaN NaN NaN 4
25 6 2.0 40000000.0 MMM NaN NaN NaN 4
26 6 2.0 12000000.0 TTT NaN NaN NaN 4
27 7 1.0 22222222.0 BBB NaN NaN NaN 4
28 7 1.0 13000000.0 UUU NaN NaN NaN 4
29 7 2.0 40000000.0 MMM NaN NaN NaN 4
30 7 2.0 12000000.0 TTT NaN NaN NaN 4
Intermediates:
hashes
3 2408158917466943276
1 -9038486839634175931
2 2408158917466943276
5 2000192472388552548
4 -9038486839634175931
6 1944556807772715310
7 1944556807772715310
dtype: int64
dict(zip(hashes.index, pd.factorize(hashes)[0]+1))
{3: 1, 1: 2, 2: 1, 5: 3, 4: 2, 6: 4, 7: 4}