Search code examples
pythonpandasdictionarymappingvectorization

How to vectorize this pandas operation?


I have a dataset of shape 4200,8. First row looks like this:

X0  X1  X2  X3  X4  X5  X6  X8
32  23  17  0   3   24  9   14

Each value is a categorical encoding that corresponds to a list of 30 values, that looks like this:

[ 0.06405287, -0.1176078 , -0.06206927,  0.08389127, -0.18036067,
  0.35158703, -0.0928449 , -0.0974429 , -0.06705306, -0.17196381,
 -0.03776502,  0.09204011,  0.47813812,  0.16258538,  0.2699648 ,
  0.07496626, -0.09791522, -0.31499937, -0.24898018,  0.06126055,
  0.13187763,  0.21042736, -0.1585868 ,  0.08355565, -0.13935572,
  0.12408883,  0.2043313 , -0.12544186, -0.09223691,  0.00720569 ]

My goal is to create a column for each value in this list, at the position of the list's corresponding categorical encoding. The list above corresponds to the value 14 at column X8, so instead of: X8 : 14 I have:

X8_1 X8_2 X8_3 ... X8_29 X8_30
0.06 -0.11 -0.62 ...-0.09 0.007

The end result being my dataframe of 8 columns becomes a dataframe of 240 columns. Of course, each row has a different set of values. Here's how I've done this: I get each unique value in the column, create a dictionary of colname:uniqueval:indexoflist:listvalatindex. Then I create a dictionary out of each row of the dataframe, and for each column and value in column I get the corresponding list and concatenate. Then I concatenate that row to the previous row.

weights = {}
for index, x in enumerate(encoded.columns): #this is the dataset with the original encoded values
  weights[x] = {}
  for id, val in enumerate(encoded[x].unique()):
    weights[x][val] = {}
    for weightid, weightval in enumerate(model_full.get_layer(embeddings[index]).get_weights()[0][id]): #this is where I get the list of 30 values from
      weights[x][val][weightid] = weightval

mappedembeddings = pd.DataFrame()
encodedindex = []
for row in encoded.iterrows(): #iterate over original dataset
  encodedindex.append(row[0]) #store index for later
  df0 = pd.DataFrame()
  for k, v in row[1].to_dict().items(): #for each key/val in row
    names = []
    for z in weights[k][v].keys():
      names.append(str(k)+'_'+str(z)) #naming (z is key of list value)

    tempdf = pd.DataFrame([weights[k][v]]) #dataframe of list at column/value key in dictionary made from embedding layer list
    tempdf.columns = names
    df0 = pd.concat([df0,tempdf],axis=1) 

  mappedembeddings = pd.concat([mappedembeddings,df0],axis=0) #concat row to previous row
  
mappedembeddings.index = encodedindex

This takes a long time. I would like to vectorize this operation, but I'm unsure how to proceed, so I'd appreciate some insights.


Solution

    1. map each of your column values to the respective lists
    2. explode the lists to individual rows and stack
    3. create the required columns names with groupby
    4. pivot to get the output
    values = df.apply(lambda x: x.map(weights[x.name]))
    values = values.explode(list(values.columns)).stack().reset_index()
    values["column"] = values["level_1"] + "_" + (values.groupby(["level_0", "level_1"]).transform("cumcount")+1).astype(str)
    output = values.pivot("level_0", "column", 0)
    
    Full working example:
    import pandas as pd
    import numpy as np
    
    np.random.seed(100)
    
    #random dataframe with three columns X0 X1 and X2
    df = pd.DataFrame(data = np.random.randint(30, size=(2,3)),
                      columns = [f"X{i}" for i in range(3)]
                      )
    
    #creating weights dictionary 
    #weights[col][number]: list of 5 numbers
    weights = dict()
    for c in df.columns:
        weights[c] = {num: np.random.rand(5) for num in df[c].unique()}
        
    values = df.apply(lambda x: x.map(weights[x.name]))
    values = values.explode(list(values.columns)).stack().reset_index()
    values["column"] = values["level_1"] + "_" + (values.groupby(["level_0", "level_1"]).transform("cumcount")+1).astype(str)
    output = values.pivot("level_0", "column", 0)
    
    >>> output
    column       X0_1      X0_2      X0_3  ...      X2_3      X2_4      X2_5
    level_0                                ...                              
    0        0.844776  0.004719  0.121569  ...  0.372832  0.005689  0.252426
    1        0.136707  0.575093  0.891322  ...  0.598843  0.603805  0.105148
    
    [2 rows x 15 columns]