python pandas dataframe categorical-data

Expand a missing value in all categories of a column in Pandas

I have the following pd.DataFrame representing constraints of an optimisation problem.

          FEATURE     COLOR        CLASS       CONSTRAINTS
0         1.0         NaN          NaN         0.20
1         3.0         NaN          NaN         0.20
2         1.0         1.0          NaN         0.15
3         1.0         NaN            b        -0.05
4         1.0         1.0            a        -0.07
5         1.0         1.0            b        -0.10
6         3.0         1.0          NaN         0.10
7         NaN         NaN          NaN         0.20

Here FEATURE represents a categorical variable with possible values [1,2,3], COLOR represents a categorical variable with possible values [1,2] and CLASS is another categorical variable with possible values [a,b,c,d].

Missing values here have the meaning "all other values". In this sense the dataframe is a compressed version of a larger dataframe encompassing all or some of the combinations of the columns categories.

What I would like to do here is to "expand" the NaN values to all possible values each individual column can represent. For example row 0 would expand to 8 total rows, being the product of "free" features, namely COLOR with possible values [1,2] and CLASS with possible values `[a,b,c,d]``

new     FEATURE   COLOR    CLASS  CONSTRAINTS
0       1         1        a      0.2
1       1         1        b      0.2
2       1         2        a      0.2
3       1         2        b      0.2
4       1         1        a      0.2
5       1         1        b      0.2
6       1         2        a      0.2
7       1         2        b      0.2

How can I efficiently perform this transformation in Pandas?

Solution

Here is one way:

mapping = {'FEATURE': [1, 2, 3], 'COLOR': [1, 2], 'CLASS': ['a', 'b', 'c', 'd']}
cols = mapping.keys()

for col in cols:
    df[col] = df[col].apply(lambda x: mapping[col] if pd.isna(x) else x)
    df = df.explode(col)

For each column of interest, we fill the NaNs with the possible values (note that fillna doesn't fill with lists, hence the apply). Then we explode the so-filled column and repeat this for all three of the columns.

We get the df at the end as:

  FEATURE COLOR CLASS  CONSTRAINTS
0       1     1     a         0.20
0       1     1     b         0.20
0       1     1     c         0.20
0       1     1     d         0.20
0       1     2     a         0.20
0       1     2     b         0.20
0       1     2     c         0.20
0       1     2     d         0.20
1       3     1     a         0.20
1       3     1     b         0.20
1       3     1     c         0.20
1       3     1     d         0.20
1       3     2     a         0.20
1       3     2     b         0.20
1       3     2     c         0.20
1       3     2     d         0.20
2       1     1     a         0.15
2       1     1     b         0.15
2       1     1     c         0.15
2       1     1     d         0.15
3       1     1     b        -0.05
3       1     2     b        -0.05
4       1     1     a        -0.07
5       1     1     b        -0.10
6       3     1     a         0.10
6       3     1     b         0.10
6       3     1     c         0.10
6       3     1     d         0.10
7       1     1     a         0.20
7       1     1     b         0.20
7       1     1     c         0.20
7       1     1     d         0.20
7       1     2     a         0.20
7       1     2     b         0.20
7       1     2     c         0.20
7       1     2     d         0.20
7       2     1     a         0.20
7       2     1     b         0.20
7       2     1     c         0.20
7       2     1     d         0.20
7       2     2     a         0.20
7       2     2     b         0.20
7       2     2     c         0.20
7       2     2     d         0.20
7       3     1     a         0.20
7       3     1     b         0.20
7       3     1     c         0.20
7       3     1     d         0.20
7       3     2     a         0.20
7       3     2     b         0.20
7       3     2     c         0.20
7       3     2     d         0.20

which has 52 rows as expected.