I want to merge columns from data frames in such a way:
list_A list_B
A, T, G G, C
B, K
C, L, AG L, AG, K
F, K
The output should be:
list_A list_B list_A&B
A, T, G G, C A, T, G, C
B, K B, K
C, L, AG L, AG, K C, L, AG, K
F, K F, K
I did:
df['list_A&B'] = df['list_A'].astype(list) + ', ' + df['list_B'].astype(list)
And, I am getting:
list_A list_B list_A&B
0 A, T, G G, C A, T, G, G, C
1 NaN B, K NaN
2 C, L, AG L, AG, K C, L, AG, L, AG, K
3 F, K NaN NaN
Here the union of the list is a problem when one of the list is empty. But, why?
I then tried unionizing as string but now the added nan cannot be dropped:
df['list_A&B'] = df['list_A'].astype(str) + ', ' + df['list_B'].astype(str)
which gives:
list_A list_B list_A&B
0 A, T, G G, C A, T, G, G, C
1 NaN B, K nan, B, K
2 C, L, AG L, AG, K C, L, AG, L, AG, K
3 F, K NaN F, K, nan
With this output, I am having difficulty removing 'nan' since they are reported as strings and dropna() and fillna() don't work with it.
Any suggestions ! - K
Looks like those are simply string values going by my interpretation of the data you've provided so far.
Steps:
Concatenate them row-wise using str.cat
with sep=','
and na_rep=''
which takes care of the NaN
values interspersed with the other string chars by treating it as an empty char.
Then, remove unwanted whitespaces present among them and also eliminate empty strings using filter(None,...)
after having taken the unique elements of the list using set
.
Finally, join the resulting list to make it a string representation of the list but excluding the brackets.
df['list_A&B'] = df['list_A'].str.cat(df['list_B'], ',','') \
.apply(lambda x: ', '.join(list(filter(None, set(re.sub(r"\s+", "", x) \
.split(','))))))
df # df.fillna('') to get back your desired output
Starting DF
used:
df = pd.DataFrame({'list_A': ['A, T, G', np.NaN, 'C, L, AG ', 'F, K'],
'list_B': ['G, C', 'B, K', 'L, AG, K', np.NaN]})
df