I have a dataset of shape 4200,8. First row looks like this:
X0 X1 X2 X3 X4 X5 X6 X8
32 23 17 0 3 24 9 14
Each value is a categorical encoding that corresponds to a list of 30 values, that looks like this:
[ 0.06405287, -0.1176078 , -0.06206927, 0.08389127, -0.18036067,
0.35158703, -0.0928449 , -0.0974429 , -0.06705306, -0.17196381,
-0.03776502, 0.09204011, 0.47813812, 0.16258538, 0.2699648 ,
0.07496626, -0.09791522, -0.31499937, -0.24898018, 0.06126055,
0.13187763, 0.21042736, -0.1585868 , 0.08355565, -0.13935572,
0.12408883, 0.2043313 , -0.12544186, -0.09223691, 0.00720569 ]
My goal is to create a column for each value in this list, at the position of the list's corresponding categorical encoding. The list above corresponds to the value 14
at column X8
, so instead of: X8 : 14
I have:
X8_1 X8_2 X8_3 ... X8_29 X8_30
0.06 -0.11 -0.62 ...-0.09 0.007
The end result being my dataframe of 8 columns becomes a dataframe of 240 columns. Of course, each row has a different set of values. Here's how I've done this:
I get each unique value in the column, create a dictionary of
colname:uniqueval:indexoflist:listvalatindex
. Then I create a dictionary out of each row of the dataframe, and for each column and value in column I get the corresponding list and concatenate. Then I concatenate that row to the previous row.
weights = {}
for index, x in enumerate(encoded.columns): #this is the dataset with the original encoded values
weights[x] = {}
for id, val in enumerate(encoded[x].unique()):
weights[x][val] = {}
for weightid, weightval in enumerate(model_full.get_layer(embeddings[index]).get_weights()[0][id]): #this is where I get the list of 30 values from
weights[x][val][weightid] = weightval
mappedembeddings = pd.DataFrame()
encodedindex = []
for row in encoded.iterrows(): #iterate over original dataset
encodedindex.append(row[0]) #store index for later
df0 = pd.DataFrame()
for k, v in row[1].to_dict().items(): #for each key/val in row
names = []
for z in weights[k][v].keys():
names.append(str(k)+'_'+str(z)) #naming (z is key of list value)
tempdf = pd.DataFrame([weights[k][v]]) #dataframe of list at column/value key in dictionary made from embedding layer list
tempdf.columns = names
df0 = pd.concat([df0,tempdf],axis=1)
mappedembeddings = pd.concat([mappedembeddings,df0],axis=0) #concat row to previous row
mappedembeddings.index = encodedindex
This takes a long time. I would like to vectorize this operation, but I'm unsure how to proceed, so I'd appreciate some insights.
map
each of your column values to the respective listsexplode
the lists to individual rows and stack
groupby
pivot
to get the outputvalues = df.apply(lambda x: x.map(weights[x.name]))
values = values.explode(list(values.columns)).stack().reset_index()
values["column"] = values["level_1"] + "_" + (values.groupby(["level_0", "level_1"]).transform("cumcount")+1).astype(str)
output = values.pivot("level_0", "column", 0)
import pandas as pd
import numpy as np
np.random.seed(100)
#random dataframe with three columns X0 X1 and X2
df = pd.DataFrame(data = np.random.randint(30, size=(2,3)),
columns = [f"X{i}" for i in range(3)]
)
#creating weights dictionary
#weights[col][number]: list of 5 numbers
weights = dict()
for c in df.columns:
weights[c] = {num: np.random.rand(5) for num in df[c].unique()}
values = df.apply(lambda x: x.map(weights[x.name]))
values = values.explode(list(values.columns)).stack().reset_index()
values["column"] = values["level_1"] + "_" + (values.groupby(["level_0", "level_1"]).transform("cumcount")+1).astype(str)
output = values.pivot("level_0", "column", 0)
>>> output
column X0_1 X0_2 X0_3 ... X2_3 X2_4 X2_5
level_0 ...
0 0.844776 0.004719 0.121569 ... 0.372832 0.005689 0.252426
1 0.136707 0.575093 0.891322 ... 0.598843 0.603805 0.105148
[2 rows x 15 columns]