I have the following dataframe with two columns c1
and c2
, I want to add a new column c3
based on the following logic, what I have works but is slow, can anyone suggest a way to vectorize this?
c1
and c2
, then for each group, the new column c3
must be populated sequentially from values
where the key is the value of c1
and each "sub group" will have subsequent values, IOW values[value_of_c1][idx]
, where idx
is the "sub group", example below(1, 'a')
, here c1
is 1
, the "sub group" "a"
index is 0
(first sub group of 1) so c3
for all rows in this group is values[1][0]
(1, 'b')
here c1
is still 1
but "sub group" is "b"
so index 1
(second sub group of 1) so for all rows in this group c3
is values[1][1]
(2, 'y')
here c1
is now 2
, "sub group" is "a"
and the index is 0
(first sub group of 2), so for all rows in this group c3
is values[2][0]
values
will have the necessary elements to satisfy this logic.Code
import pandas as pd
df = pd.DataFrame(
{
"c1": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
"c2": ["a", "a", "a", "b", "b", "b", "y", "y", "y", "z", "z", "z"],
}
)
new_df = pd.DataFrame()
values = {1: ["a1", "a2"], 2: ["b1", "b2"]}
for i, j in df.groupby("c1"):
for idx, (k, l) in enumerate(j.groupby("c2")):
l["c3"] = values[i][idx]
new_df = new_df.append(l)
Output (works but my code is slow)
c1 c2 c3
0 1 a a1
1 1 a a1
2 1 a a1
3 1 b a2
4 1 b a2
5 1 b a2
6 2 y b1
7 2 y b1
8 2 y b1
9 2 z b2
10 2 z b2
11 2 z b2
If you don't mind using another library, you basically need to label encode within your groups:
from sklearn.preprocessing import LabelEncoder
def le(x):
return pd.DataFrame(LabelEncoder().fit_transform(x),index=x.index)
df['idx'] = df.groupby('c1')['c2'].apply(le)
df['c3'] = df.apply(lambda x:values[x['c1']][x['idx']],axis=1)
c1 c2 idx c3
0 1 a 0 a1
1 1 a 0 a1
2 1 a 0 a1
3 1 b 1 a2
4 1 b 1 a2
5 1 b 1 a2
6 2 y 0 b1
7 2 y 0 b1
8 2 y 0 b1
9 2 z 1 b2
10 2 z 1 b2
11 2 z 1 b2
Otherwise it's a matter of using pd.Categorical
, same concept as above, just that you convert within each group, the column into a category and just pull out the code:
def le(x):
return pd.DataFrame(pd.Categorical(x).codes,index=x.index)