Search code examples
python-3.xpandasunique

Unique values across columns row-wise in pandas with missing values


I have a dataframe like

import pandas as pd
import numpy as np

df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
                  "Col2": ['A', 'B', 'B', 'A', 'C'],
                  "Col3": ['A', 'B', 'C', 'A', 'C']})

I want to get the unique combinations across columns for each row and create a new column with those values, excluding the missing values.

The code I have right now to do this is

def handle_missing(s):
    
    return np.unique(s[s.notnull()])
    
def unique_across_rows(data):
    
    
    unique_vals = data.apply(handle_missing, axis = 1)
    
    # numpy unique sorts the values automatically
    merged_vals = unique_vals.apply(lambda x: x[0] if len(x) == 1 else '_'.join(x))
    
    return merged_vals

df['Combos'] = unique_across_rows(df)

This returns the expected output:

       Col1    Col2  Col3   Combos
  0       A       A     A       A
  1     NaN       B     B       B
  2       B       B     C       B_C
  3       B       A     A       A_B
  4       C       C     C       C

It seems to me that there should be a more vectorized approach that exists within Pandas to do this: how could I do that?


Solution

  • You can try a simple list comprehension which might be more efficient for larger dataframes:

    df['combos'] = ['_'.join(sorted(k for k in set(v) if pd.notnull(k))) for v in df.values]
    

    Or you can wrap the above list comprehension in a more readable function:

    def combos():
        for v in df.values:
            unique = set(filter(pd.notnull, v))
            yield '_'.join(sorted(unique))
    
    df['combos'] = list(combos())
    

      Col1 Col2 Col3 combos
    0    A    A    A      A
    1  NaN    B    B      B
    2    B    B    C    B_C
    3    B    A    A    A_B
    4    C    C    C      C