Search code examples
pythonpandasdataframevectorization

How to vectorize the following, the append is probably the bottleneck


I have the following dataframe with two columns c1 and c2, I want to add a new column c3 based on the following logic, what I have works but is slow, can anyone suggest a way to vectorize this?

  • Must be grouped based on c1 and c2, then for each group, the new column c3 must be populated sequentially from values where the key is the value of c1 and each "sub group" will have subsequent values, IOW values[value_of_c1][idx], where idx is the "sub group", example below
  • The first group (1, 'a'), here c1 is 1, the "sub group" "a" index is 0 (first sub group of 1) so c3 for all rows in this group is values[1][0]
  • The second group (1, 'b') here c1 is still 1 but "sub group" is "b" so index 1 (second sub group of 1) so for all rows in this group c3 is values[1][1]
  • The third group (2, 'y') here c1 is now 2, "sub group" is "a" and the index is 0 (first sub group of 2), so for all rows in this group c3 is values[2][0]
  • And so on
  • values will have the necessary elements to satisfy this logic.

Code

import pandas as pd

df = pd.DataFrame(
    {
        "c1": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
        "c2": ["a", "a", "a", "b", "b", "b", "y", "y", "y", "z", "z", "z"],
    }
)
new_df = pd.DataFrame()
values = {1: ["a1", "a2"], 2: ["b1", "b2"]}
for i, j in df.groupby("c1"):
    for idx, (k, l) in enumerate(j.groupby("c2")):
        l["c3"] = values[i][idx]
        new_df = new_df.append(l)

Output (works but my code is slow)

    c1 c2  c3
0    1  a  a1
1    1  a  a1
2    1  a  a1
3    1  b  a2
4    1  b  a2
5    1  b  a2
6    2  y  b1
7    2  y  b1
8    2  y  b1
9    2  z  b2
10   2  z  b2
11   2  z  b2

Solution

  • If you don't mind using another library, you basically need to label encode within your groups:

    from sklearn.preprocessing import LabelEncoder
    
    def le(x):
        return pd.DataFrame(LabelEncoder().fit_transform(x),index=x.index)
        
    df['idx'] = df.groupby('c1')['c2'].apply(le)
    
    df['c3'] = df.apply(lambda x:values[x['c1']][x['idx']],axis=1)
    
        c1 c2  idx  c3
    0    1  a    0  a1
    1    1  a    0  a1
    2    1  a    0  a1
    3    1  b    1  a2
    4    1  b    1  a2
    5    1  b    1  a2
    6    2  y    0  b1
    7    2  y    0  b1
    8    2  y    0  b1
    9    2  z    1  b2
    10   2  z    1  b2
    11   2  z    1  b2
    

    Otherwise it's a matter of using pd.Categorical , same concept as above, just that you convert within each group, the column into a category and just pull out the code:

    def le(x):
        return pd.DataFrame(pd.Categorical(x).codes,index=x.index)