Search code examples
pythonperformancepandasstring-concatenation

python panda join dynamic columns by removing duplicates


we have an use case where we need to join all column values in a row by removing duplicates.Data is stored in a data frame of panda. For eg consider below data frame df with columns A,B,C

A   B   C   
X1  AX  X1
X2  X2  X1
X3  X3  X3
X4  XX  XX

I would like to add a new column which concatenates A to B to C and remove duplicates if any found by preserving the order. The output would be like

A   B   C   Newcol

X1  AX  X1  X1_AX
X2  X2  X1  X2_X1
X3  X3  X3  X3
X4  XX  XX  X4_XX

Note that the number of columns are dynamic. As of now I am doing it by using the command

df.apply(lambda x: '-'.join(x.dropna().astype(str).drop_duplicates()),axis=1)

But this is very slow and takes around 150 seconds for my data. But since 90% of the data frame are usually with only 2 columns , I put an if statement in my code and run the below command for cases with 2 columns

t1=pd.Series(np.where(df.iloc[:,0].dropna().astype(str) != df.iloc[:,1].dropna().astype(str), df.iloc[:,0].dropna().astype(str)+"-"+df.iloc[:,1].dropna().astype(str),df.iloc[:,1].dropna().astype(str)))

which takes around 55.3 milli seconds

or even

t1=df.iloc[:,0].dropna().astype(str).where(df.iloc[:,0].dropna().astype(str) == df.iloc[:,1].dropna().astype(str), df.iloc[:,0].dropna().astype(str)+"-"+df.iloc[:,1].dropna().astype(str))

both consumes almost same time ( 55 ms opposed to 150 seconds ), but issue is that it is applicable only for 2 columns. I would like to create a generalised statement , so that it can handle n number of columns. I tried using reduce on top ,but it gave error while i tried for 3 columns.

reduce((lambda x,y:pd.Series(np.where(df.iloc[:,x].dropna().astype(str) != df.iloc[:,y].dropna().astype(str), df.iloc[:,x].dropna().astype(str)+"-"+df.iloc[:,y].dropna().astype(str),df.iloc[:,y].dropna().astype(str)))),list(range(df.shape[1])))

TypeError: '>=' not supported between instances of 'str' and 'int'

Please note that the df is actually a chunk of a multicore parallel task. So it would be great if the suggestions excludes parallelism.


Solution

  • Try

    df['new'] = df.astype('str').apply(lambda x: '_'.join(set(x)), axis = 1)
    
        A   B   C   new
    0   X1  AX  X1  AX_X1
    1   X2  X2  X1  X1_X2
    2   X3  X3  X3  X3
    3   X4  XX  XX  X4_XX
    

    EDIT: Maintain the order of the column values

    def my_append(x):
        l = []
        for elm in x:
            if elm not in l:
                l.append(elm)
        return '_'.join(l)
    
    
    df['New col']=df.astype('str').apply(my_append, axis = 1)
    
    1000 loops, best of 3: 871 µs per loop
    

    Returns

        A   B   C   New col
    0   X1  AX  X1  X1_AX
    1   X2  X2  X1  X2_X1
    2   X3  X3  X3  X3
    3   X4  XX  XX  X4_XX
    

    EDIT 1: In case you have nan in any column like this

        A   B   C
    0   X1  AX  X1
    1   X2  X2  X1
    2   X3  X3  X3
    3   NaN XX  XX
    

    Handle that in the function and then apply

    def my_append(x):
    l = []
    for elm in x:
        if elm not in l:
            l.append(elm)
    l = [x for x in l if str(x) != 'nan']
    return '_'.join(l)
    
    df['New col']=df.astype('str').apply(my_append, axis = 1)
    
    
        A   B   C   New col
    0   X1  AX  X1  X1_AX
    1   X2  X2  X1  X2_X1
    2   X3  X3  X3  X3
    3   NaN XX  XX  XX