I have a dataframe
df = ID C1 C2 C3 C4
1 a b r q
1 b e g h
2 p a z p
1 r a n m
I want to get list of values in C1, C2 per ID. So I will have:
out = ID l
1 [a,b,r,e]
2 [p,a]
What is the best way to do so?
Use lambda function with numpy.unique
- output is sorted:
df1 = (df.groupby('ID')[['C1','C2']].apply(lambda x: np.unique(x).tolist())
.reset_index(name='l'))
print (df1)
ID l
0 1 [a, b, e, r]
1 2 [a, p]
Or with unique
- output is with original order:
df1 = (df.groupby('ID')[['C1','C2']].apply(lambda x: pd.unique(np.ravel(x)).tolist())
.reset_index(name='l'))
print (df1)
ID l
0 1 [a, b, e, r]
1 2 [p, a]
Another idea with remove missing values and duplicates:
df1 = (df.melt(id_vars='ID', value_vars=['C1','C2'], value_name='l')
.drop_duplicates(['ID','l'])
.dropna(subset=['l'])
.groupby('ID')['l']
.agg(list)
.reset_index())
print (df1)
ID l
0 1 [a, b, r, e]
1 2 [p, a]
EDIT: For separate columns use GroupBy.agg
:
df2 = df.groupby('ID')[['C1','C2']].agg(lambda x: np.unique(x).tolist()).reset_index()
print (df2)
ID C1 C2
0 1 [a, b, r] [a, b, e]
1 2 [p] [a]