Search code examples
pythonpandasdataframehashlib

How do I create a hashing algorithim based on a combination of two variables in a dataframe?


Im new to python and working on hashing algo.

I have a dataframe-

df2
Out[55]: 
         CID                 SID
0        2094825             141
1        2327668             583
2        2259956             155
3        1985370             100
4        2417177              47
         ...             ...
1030748  2262027             100
1030749  2232061             100
1030750  2027795             169
1030751  2474609             100
1030752  2335654             169

[1030753 rows x 2 columns]

How do i use the hashlib python library to get a hashing algorithm such that each combination of CID and STD gives me a unique encryption such as CID 2262027 and SID 100 is fj6x55 and CID 2232061 and SID 100 gives another unique encryption of f6223xi, etc. As long as the combinations are unique. I want unique encryptions. If they repeat then the encryption should be same.. Im open to other suggestions like one hot encoding too incase hashlib is not working. So far I am getting an error -

import hashlib
x = hashlib.md5(df2['SID'])
Traceback (most recent call last):

  File "<ipython-input-60-44772f235990>", line 1, in <module>
    x = hashlib.md5(df2['SubDiagnosisId'])

TypeError: object supporting the buffer API required

Solution

  • Here's my attempt at this one:

    hashes = df2.apply(lambda x:hashlib.md5((str(x[0])+str(x[1])).encode('utf8')).hexdigest(), axis=1)

    Some explanation:

    df2.apply takes a function, in this case an anonymous lambda function, as well as the axis over which we want to apply the function. In this case, axis=1 applies over each row.

    Breakdown of the hashing function:

    The anonymous function takes one argument x, which consists of two columns. We break down x into x[0] (the first column CID) and x[1] (the second column SID).

    Here, we have two choices. We can either convert the integers into strings and concatenate the strings as I've done here, or multiply the CID value by some constant that is at least max(SID). However, I think string concatenation may not be unique enough for this case. The better approach may be the following:

    df.apply(lambda x:hashlib.md5(str(x[0]*1024+x[1]).encode('utf8')).hexdigest(), axis=1)

    You noted that the max SID value is 583, so I chose the next available power of 2 as the multiplier. This effectively left-shifts all CID values by 10 bits so that all 10 LSB bits are now zero. Then we fill those LSB bits with SID values using addition.

    Either way, the final representation needs to be an encoded byte string, hence the str(integer_stuff).encode('utf8') part. Finally, we enclose that result inside hashlib.md5() and call .hexdigest() to retrieve the hexadecimal string representation of the hash.

    Improvements to my approach as far as Pandas itself is concerned are welcome :) but I think my hashing approach itself is quite sound.

    EDIT:

    In order to join the result to the original DataFrame, try the following:

    # Calculate the hashes. This gives you a Series.
    hashes = df2.apply(lambda x:hashlib.md5((str(x[0])+str(x[1])).encode('utf8')).hexdigest(), axis=1)
    # Create a DataFrame from the above Series
    df_hash = pd.DataFrame(hashes, columns=['hash'])
    # Join the hashes with the original DataFrame
    df2 = df2.join(df_hash)
    

    Tested with a short set of data, so it should work for you too :)