Converting pandas df column of values to uniqueidentifiers

I have two pandas dataframes which look like this:

df1
            node_id        lat       long
0      [INET_N_855]  53.017810  23.896413
1     [INET_N_1828]  52.984994  22.241386
2      [INET_N_329]  52.881484  20.619795
3     [INET_N_1612]  46.505528  13.592806
4     [INET_N_1009]  46.503733  13.416054
...             ...        ...        ...
4670  [SEQ_12031_p]  49.697490  12.328040
4671      [NO_N_30]  59.272825   5.519794
4672   [INET_N_379]  35.828836  14.556524
4673  [INET_N_1287]  61.638170  21.398810
4674      [Prod_33]  64.982320   6.611590
[4675 rows x 3 columns]

df2
                         node_id  ...                    long
0      [INET_N_855, INET_N_1828]  ...  [23.896413, 22.241386]
1      [INET_N_1828, INET_N_329]  ...  [22.241386, 20.619795]
2     [INET_N_1612, INET_N_1009]  ...  [13.592806, 13.416054]
3     [INET_N_1612, INET_N_1009]  ...  [13.592806, 13.416054]
4     [INET_N_1612, INET_N_1009]  ...  [13.592806, 13.416054]
...                          ...  ...                     ...
6318    [SEQ_6435_p, INET_N_379]  ...   [13.88715, 14.556524]
6319  [N_14_M_LMGN, INET_N_1287]  ...    [23.08042, 21.39881]
6320      [SEQ_12356_p, Prod_33]  ...     [6.755214, 6.61159]
6321  [N_261_M_LMGN, SEQ_2566_p]  ...    [25.34835, 25.25854]
6322  [N_261_M_LMGN, SEQ_2566_p]  ...    [25.34835, 25.25854]
[6323 rows x 3 columns]

df2 column 'node_id' is consisting items from df1 column 'node_id'. Sadly some of the items in 'node_id' are too long. Therefore, these list items in 'node_id' should be shortened to equal or less than 12 characters for inputing a simulation program.

To achieve this, I will need a unique_identifier_generator(df1, df2) function, which will convert the entries in df1['node_id'] to some unique id equal/less than 12 characters, and also does the same thing to df2['node_id'] with matching unique ids.

I think I can do the pandas element change part. However, I do not know how to create a unique_identifier_generator function.

Do you know what to use? or what python-package I should check? or maybe a simple way to generate unique ids from given string or given pandas Series?

Solution

this kind of sounds like something a hash-function could do, there are plenty of those in python hashlib doc but you should probably take one which does not have high collision weakness

besides that you could check out the LabelEncoder from sklearn which might be easier since collisions shouldn't occur there

basic example for adding an id:

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder().fit(df.node_id)
df["id"] = encoder.transform(df.node_id)

the id could be converted to string or something like that, but int may be more useful in some cases

conversion to str might look something like this:

df["id"] = [f"node_{id}" for id in encoder.transform(df.node_id)]