Search code examples
pythonpandasdataframecategorical

Combine multiple categorical columns into one, when each row has only one non-NaN value, in Pandas


I have

import pandas as pd
import numpy as np
df = pd.DataFrame({"x": ["red", "blue", np.nan, np.nan, np.nan, np.nan, np.nan, ],
                   "y": [np.nan, np.nan, np.nan, 'cold', 'warm', np.nan, np.nan, ],
                   "z": [np.nan, np.nan,  np.nan, np.nan, np.nan, 'charm', 'strange'],
                  }).astype("category")

giving

     x     y        z
0   red   NaN      NaN
1  blue   NaN      NaN
2   NaN   NaN      NaN
3   NaN  cold      NaN
4   NaN  warm      NaN
5   NaN   NaN    charm
6   NaN   NaN  strange

I would like to add a new categorical column with unordered values red,blue,hot,cold,warm, charm, strange, filled in appropriately. I have many such columns, not just three.

Some possiblities:

  • astype(str) and concatenating and then re-creating a categorical
  • creating a new categorical type using union_categoricals and then cast each column to that type? and then serially fillna() them?

I can't make those or anything else work.

Notes: using .astype(pd.CategoricalDtype(ordered=True)) in place of .astype("category") in defining df also works with the answer below.


Solution

  • New Solution

    For the purpose of using for a large datasets, the following solution may be more efficient:

    def my_fun(x):
        m = ~ pd.isnull(x)
        if m.any():
            return x[m]
        else:
            return np.nan
    
    df['new'] = np.apply_along_axis(my_fun, 1, df.to_numpy())
    
          x     y        z      new
    0   red   NaN      NaN      red
    1  blue   NaN      NaN     blue
    2   NaN   NaN      NaN      NaN
    3   NaN  cold      NaN     cold
    4   NaN  warm      NaN     warm
    5   NaN   NaN    charm    charm
    6   NaN   NaN  strange  strange
    

    Edited answer

    As specified by the OP, in case there are rows where all values are np.NaN we could try the following solution:

    df['new_col'] = df.dropna(how='all').apply(lambda x: x.loc[x.first_valid_index()], axis=1)
    df['new_col'] = pd.Categorical(df.new_col)
    df
    
          x     y        z  new_col
    0   red   NaN      NaN      red
    1  blue   NaN      NaN     blue
    2   NaN   NaN      NaN      NaN
    3   NaN  cold      NaN     cold
    4   NaN  warm      NaN     warm
    5   NaN   NaN    charm    charm
    6   NaN   NaN  strange  strange