Search code examples
pythonmergecol

Merge multiple columns into one, dropping duplicates across a row and removing nan values


I am trying to merge multiple columns into a single column while dropping duplicates and dropping null values but keeping the rows. What I have:

Index  |   A  |  B  | C
0      |  nan | nan | nan
1      |   1  |  1  | nan
2      |   x  |  2  | nan

What I need:

Index  |   New_col  
0      |   
1      |   1  
2      |   x,2  

What I have tried:

df= pd.DataFrame(np.array([['nan', 'nan', 'nan'], ['nan', 1, 1], ['nan', 'x', 2]]), columns=['A', 'B', 'C'])

df['a]=df[['a','b','c]].agg(', '.join, axis=1) 

Also tried: .stack().unstack() and .join but I cannot get these to remove duplicates by row.


Solution

  • Not sure if set preserves the order of incoming elements, but we can chain some operations together

    import pandas as pd
    import numpy as np
    from io import StringIO
    
    df = pd.read_csv(StringIO(
    """A|B|C
    ||
    1|1|
    x|2|"""), sep="|", dtype="object")
    
    >>> df.apply(lambda x: ",".join(set([str(i) for i in x if pd.notnull(i)])), axis=1)
    0
    1      1
    2    x,2
    dtype: object