Search code examples
pythonpandashashmaplogicpandas-explode

How to explode pandas dataframe with lists to label the ones in the same row with same id?


For example, I have a pandas dataframe like this :

Original Dataframe

Ignoring the "Name" column, I want a dataframe that looks like this, labelling the Hashes of the same group with their "ID"

Processed Image

Here, we traverse each row, we encounter "8a43", and assign ID 1 to it, and wherever we find the same hash value, we assign ID as 1. Then we move on to the next row, and encounter 79e2 and b183. We then traverse all the rows and wherever we find these values, we store their IDs as 2. Now the issue will arise when we reach "abc7". It will be assigned ID=5 as it was previously encountered in "abc5". But I also want that in rows after the current one, wherever I find "26ea", assign the ID=5 to those as well.

I hope all this makes sense. If not, feel free to reach out to me via comments or message. I will clear it out quickly.


Solution

  • Solution using dict

    import numpy as np
    import pandas as pd
    
    hashvalues = list(df['Hash_Value'])
    
    dic, i = {}, 1
    id_list = []
    for hashlist in hashvalues:
        # convert to list
        if isinstance(hashlist, str):
            hashlist = hashlist.replace('[','').replace(']', '')
            hashlist = hashlist.split(',')
    
            # check if the hash is unknown
            if hashlist[0] not in dic:
                # Assign a new id
                dic[hashlist[0]] = i
                k = i
                i += 1
            else:
                # if known use existing id
                k = dic[hashlist[0]]
                
            for h in hashlist[1:]:
                # set id of the rest of the list hashes
                # equal to the first hashes's id
                dic[h] = k
                
            id_list.append(k)
        else:
            id_list.append(np.nan)
        
         print(df)
    
                   Hash   Name  ID
    0            [8a43]   abc1   1
    1       [79e2,b183]   abc2   2
    2            [f82a]   abc3   3
    3            [b183]   abc4   2
    4  [eaa7,5ea9,1cee]   abc5   4
    5            [5ea9]   abc6   4
    6       [1cee,26ea]   abc7   4
    7            [79e2]   abc8   2
    8            [8a43]   abc9   1
    9            [26ea]  abc10   4