I have two pandas dataframes which look like this:
df1
node_id lat long
0 [INET_N_855] 53.017810 23.896413
1 [INET_N_1828] 52.984994 22.241386
2 [INET_N_329] 52.881484 20.619795
3 [INET_N_1612] 46.505528 13.592806
4 [INET_N_1009] 46.503733 13.416054
... ... ... ...
4670 [SEQ_12031_p] 49.697490 12.328040
4671 [NO_N_30] 59.272825 5.519794
4672 [INET_N_379] 35.828836 14.556524
4673 [INET_N_1287] 61.638170 21.398810
4674 [Prod_33] 64.982320 6.611590
[4675 rows x 3 columns]
df2
node_id ... long
0 [INET_N_855, INET_N_1828] ... [23.896413, 22.241386]
1 [INET_N_1828, INET_N_329] ... [22.241386, 20.619795]
2 [INET_N_1612, INET_N_1009] ... [13.592806, 13.416054]
3 [INET_N_1612, INET_N_1009] ... [13.592806, 13.416054]
4 [INET_N_1612, INET_N_1009] ... [13.592806, 13.416054]
... ... ... ...
6318 [SEQ_6435_p, INET_N_379] ... [13.88715, 14.556524]
6319 [N_14_M_LMGN, INET_N_1287] ... [23.08042, 21.39881]
6320 [SEQ_12356_p, Prod_33] ... [6.755214, 6.61159]
6321 [N_261_M_LMGN, SEQ_2566_p] ... [25.34835, 25.25854]
6322 [N_261_M_LMGN, SEQ_2566_p] ... [25.34835, 25.25854]
[6323 rows x 3 columns]
df2
column 'node_id'
is consisting items from df1
column 'node_id'
. Sadly some of the items in 'node_id'
are too long. Therefore, these list items in 'node_id'
should be shortened to equal or less than 12 characters for inputing a simulation program.
To achieve this, I will need a unique_identifier_generator(df1, df2)
function, which will convert the entries in df1['node_id']
to some unique id equal/less than 12 characters, and also does the same thing to df2['node_id']
with matching unique ids.
I think I can do the pandas element change part. However, I do not know how to create a unique_identifier_generator
function.
Do you know what to use? or what python-package I should check? or maybe a simple way to generate unique ids from given string or given pandas Series?
this kind of sounds like something a hash-function could do, there are plenty of those in python hashlib doc but you should probably take one which does not have high collision weakness
besides that you could check out the LabelEncoder from sklearn which might be easier since collisions shouldn't occur there
basic example for adding an id:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder().fit(df.node_id)
df["id"] = encoder.transform(df.node_id)
the id could be converted to string or something like that, but int may be more useful in some cases
conversion to str might look something like this:
df["id"] = [f"node_{id}" for id in encoder.transform(df.node_id)]