Search code examples
pythonpandasdataframegroup-byunique

Identifying unique Pandas DataFrame groups and adding a new column / index accordingly


I've a Pandas DataFrame which contains a list of "solutions" explaining which materials should be added and removed to apply those solutions. Unfortunately, some solutions hold the same materials which should go in and out. Therefore I'd like to figure out which solutions are basically identical (except for the value in column "Solution") and add a new column to make that clear.

My dataset looks like this:

Solution IN_amount in_matnr in_mat OUT_amount out_matnr out_mat
3 1 80000000 QQQ
3 1 22222222 BBB
3 1 33333333 CCC
3 2 44444444 DDD
3 3 55555555 EEE
3 1 18000000 ZZZ
1 1 22222222 BBB
1 1 11000000 SSS
1 2 40000000 MMM
1 2 12000000 TTT
2 1 80000000 QQQ
2 1 22222222 BBB
2 1 33333333 CCC
2 2 44444444 DDD
2 3 55555555 EEE
2 1 18000000 ZZZ

[...]

Here, the solutions 3 and 2 can be seen as identical solutions (the same materials are going in and out). I'd like to add a new column to mark those groups which contain the identical materials. Simplified:

Solution ... Solution_unique
3 ... 1
... ... ...
1 ... 2
... ... ...
2 ... 1
... ... ...

This is what I have so far:

import pandas as pd

data = [{"Solution": 3, "IN_amount": 1, "in_matnr": 80000000, "in_mat": "QQQ"},
        {"Solution": 3, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 3, "IN_amount": 1, "in_matnr": 33333333, "in_mat": "CCC"},
        {"Solution": 3, "IN_amount": 2, "in_matnr": 44444444, "in_mat": "DDD"},
        {"Solution": 3, "IN_amount": 3, "in_matnr": 55555555, "in_mat": "EEE"},
        {"Solution": 3, "OUT_amount": 1, "out_matnr": 18000000, "out_mat": "ZZZ"},
        {"Solution": 1, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 1, "IN_amount": 1, "in_matnr": 11000000, "in_mat": "SSS"},
        {"Solution": 1, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
        {"Solution": 1, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"},
        {"Solution": 2, "IN_amount": 1, "in_matnr": 80000000, "in_mat": "QQQ"},
        {"Solution": 2, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 2, "IN_amount": 1, "in_matnr": 33333333, "in_mat": "CCC"},
        {"Solution": 2, "IN_amount": 2, "in_matnr": 44444444, "in_mat": "DDD"},
        {"Solution": 2, "IN_amount": 3, "in_matnr": 55555555, "in_mat": "EEE"},
        {"Solution": 2, "OUT_amount": 1, "out_matnr": 18000000, "out_mat": "ZZZ"},
        {"Solution": 5, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 5, "IN_amount": 2, "in_matnr": 90000000, "in_mat": "RRR"},
        {"Solution": 5, "OUT_amount": 1, "out_matnr": 17000000, "out_mat": "YYY"},
        {"Solution": 4, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 4, "IN_amount": 1, "in_matnr": 11000000, "in_mat": "SSS"},
        {"Solution": 4, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
        {"Solution": 4, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"},
        {"Solution": 6, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 6, "IN_amount": 1, "in_matnr": 13000000, "in_mat": "UUU"},
        {"Solution": 6, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
        {"Solution": 6, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"},
        {"Solution": 7, "IN_amount": 1, "in_matnr": 22222222, "in_mat": "BBB"},
        {"Solution": 7, "IN_amount": 1, "in_matnr": 13000000, "in_mat": "UUU"},
        {"Solution": 7, "IN_amount": 2, "in_matnr": 40000000, "in_mat": "MMM"},
        {"Solution": 7, "IN_amount": 2, "in_matnr": 12000000, "in_mat": "TTT"}]

df = pd.DataFrame(data)

df = df.groupby("Solution")

Which gives me something like

  Solution  IN_amount    in_matnr in_mat  OUT_amount   out_matnr out_mat
0          3        1.0  80000000.0    QQQ         NaN         NaN     NaN
1          3        1.0  22222222.0    BBB         NaN         NaN     NaN
2          3        1.0  33333333.0    CCC         NaN         NaN     NaN
3          3        2.0  44444444.0    DDD         NaN         NaN     NaN
4          3        3.0  55555555.0    EEE         NaN         NaN     NaN
5          3        NaN         NaN    NaN         1.0  18000000.0     ZZZ
6          1        1.0  22222222.0    BBB         NaN         NaN     NaN
7          1        1.0  11000000.0    SSS         NaN         NaN     NaN
8          1        2.0  40000000.0    MMM         NaN         NaN     NaN
9          1        2.0  12000000.0    TTT         NaN         NaN     NaN
[...]

Now I'm stuck at counting unique groups and adding a kind of marker inside a new column (Solution_unique).

Thank you in advance, I'm looking forward to your assistance! Max


Solution

  • I would extract the subdataframes, compute their hashes with pandas.util.hash_pandas_object/hash and use this information to identify the identical groups:

    hashes = pd.Series({k: hash(tuple(pd.util.hash_pandas_object(g, index=False)))
                        for k, g in df.sort_values(by=list(df.columns.difference(['Solution'])))
                                      .drop(columns='Solution').groupby(df['Solution'], sort=False)}
                      )
    
    df['Solution_unique'] = df['Solution'].map(dict(zip(hashes.index, pd.factorize(hashes)[0]+1)))
    

    Output:

        Solution  IN_amount    in_matnr in_mat  OUT_amount   out_matnr out_mat  Solution_unique
    0          3        1.0  80000000.0    QQQ         NaN         NaN     NaN                1
    1          3        1.0  22222222.0    BBB         NaN         NaN     NaN                1
    2          3        1.0  33333333.0    CCC         NaN         NaN     NaN                1
    3          3        2.0  44444444.0    DDD         NaN         NaN     NaN                1
    4          3        3.0  55555555.0    EEE         NaN         NaN     NaN                1
    5          3        NaN         NaN    NaN         1.0  18000000.0     ZZZ                1
    6          1        1.0  22222222.0    BBB         NaN         NaN     NaN                2
    7          1        1.0  11000000.0    SSS         NaN         NaN     NaN                2
    8          1        2.0  40000000.0    MMM         NaN         NaN     NaN                2
    9          1        2.0  12000000.0    TTT         NaN         NaN     NaN                2
    10         2        1.0  80000000.0    QQQ         NaN         NaN     NaN                1
    11         2        1.0  22222222.0    BBB         NaN         NaN     NaN                1
    12         2        1.0  33333333.0    CCC         NaN         NaN     NaN                1
    13         2        2.0  44444444.0    DDD         NaN         NaN     NaN                1
    14         2        3.0  55555555.0    EEE         NaN         NaN     NaN                1
    15         2        NaN         NaN    NaN         1.0  18000000.0     ZZZ                1
    16         5        1.0  22222222.0    BBB         NaN         NaN     NaN                3
    17         5        2.0  90000000.0    RRR         NaN         NaN     NaN                3
    18         5        NaN         NaN    NaN         1.0  17000000.0     YYY                3
    19         4        1.0  22222222.0    BBB         NaN         NaN     NaN                2
    20         4        1.0  11000000.0    SSS         NaN         NaN     NaN                2
    21         4        2.0  40000000.0    MMM         NaN         NaN     NaN                2
    22         4        2.0  12000000.0    TTT         NaN         NaN     NaN                2
    23         6        1.0  22222222.0    BBB         NaN         NaN     NaN                4
    24         6        1.0  13000000.0    UUU         NaN         NaN     NaN                4
    25         6        2.0  40000000.0    MMM         NaN         NaN     NaN                4
    26         6        2.0  12000000.0    TTT         NaN         NaN     NaN                4
    27         7        1.0  22222222.0    BBB         NaN         NaN     NaN                4
    28         7        1.0  13000000.0    UUU         NaN         NaN     NaN                4
    29         7        2.0  40000000.0    MMM         NaN         NaN     NaN                4
    30         7        2.0  12000000.0    TTT         NaN         NaN     NaN                4
    

    Intermediates:

    hashes
    
    3    2408158917466943276
    1   -9038486839634175931
    2    2408158917466943276
    5    2000192472388552548
    4   -9038486839634175931
    6    1944556807772715310
    7    1944556807772715310
    dtype: int64
    
    
    dict(zip(hashes.index, pd.factorize(hashes)[0]+1))
    
    {3: 1, 1: 2, 2: 1, 5: 3, 4: 2, 6: 4, 7: 4}