Search code examples
pythonlistpandasarray-merge

How to merge columns containing list values when some column list values are emptly?


I want to merge columns from data frames in such a way:

list_A      list_B       
A, T, G     G, C        
            B, K         
C, L, AG    L, AG, K     
F, K                 

The output should be:

list_A      list_B       list_A&B
A, T, G     G, C         A, T, G, C
            B, K         B, K
C, L, AG    L, AG, K     C, L, AG, K
F, K                     F, K

I did:

df['list_A&B'] = df['list_A'].astype(list) + ', ' + df['list_B'].astype(list)

And, I am getting:

     list_A    list_B            list_A&B
0   A, T, G      G, C       A, T, G, G, C
1       NaN      B, K                 NaN
2  C, L, AG  L, AG, K  C, L, AG, L, AG, K
3      F, K       NaN                 NaN

Here the union of the list is a problem when one of the list is empty. But, why?

I then tried unionizing as string but now the added nan cannot be dropped: df['list_A&B'] = df['list_A'].astype(str) + ', ' + df['list_B'].astype(str)

which gives:

     list_A    list_B            list_A&B
0   A, T, G      G, C       A, T, G, G, C
1       NaN      B, K           nan, B, K
2  C, L, AG  L, AG, K  C, L, AG, L, AG, K
3      F, K       NaN           F, K, nan

With this output, I am having difficulty removing 'nan' since they are reported as strings and dropna() and fillna() don't work with it.

Any suggestions ! - K


Solution

  • Looks like those are simply string values going by my interpretation of the data you've provided so far.

    Steps:

    Concatenate them row-wise using str.catwith sep=',' and na_rep='' which takes care of the NaN values interspersed with the other string chars by treating it as an empty char.

    Then, remove unwanted whitespaces present among them and also eliminate empty strings using filter(None,...) after having taken the unique elements of the list using set.

    Finally, join the resulting list to make it a string representation of the list but excluding the brackets.

    df['list_A&B'] = df['list_A'].str.cat(df['list_B'], ',','') \
                      .apply(lambda x: ', '.join(list(filter(None, set(re.sub(r"\s+", "", x) \
                      .split(','))))))
    
    df                  # df.fillna('') to get back your desired output
    

    enter image description here


    Starting DF used:

    df = pd.DataFrame({'list_A': ['A, T, G', np.NaN, 'C, L, AG ', 'F, K'], 
                       'list_B': ['G, C', 'B, K', 'L, AG, K', np.NaN]})
    df
    

    enter image description here