Search code examples
pythonpandasdataframecategorical-data

Expand a missing value in all categories of a column in Pandas


I have the following pd.DataFrame representing constraints of an optimisation problem.

          FEATURE     COLOR        CLASS       CONSTRAINTS
0         1.0         NaN          NaN         0.20
1         3.0         NaN          NaN         0.20
2         1.0         1.0          NaN         0.15
3         1.0         NaN            b        -0.05
4         1.0         1.0            a        -0.07
5         1.0         1.0            b        -0.10
6         3.0         1.0          NaN         0.10
7         NaN         NaN          NaN         0.20

Here FEATURE represents a categorical variable with possible values [1,2,3], COLOR represents a categorical variable with possible values [1,2] and CLASS is another categorical variable with possible values [a,b,c,d].

Missing values here have the meaning "all other values". In this sense the dataframe is a compressed version of a larger dataframe encompassing all or some of the combinations of the columns categories.

What I would like to do here is to "expand" the NaN values to all possible values each individual column can represent. For example row 0 would expand to 8 total rows, being the product of "free" features, namely COLOR with possible values [1,2] and CLASS with possible values `[a,b,c,d]``

new     FEATURE   COLOR    CLASS  CONSTRAINTS
0       1         1        a      0.2
1       1         1        b      0.2
2       1         2        a      0.2
3       1         2        b      0.2
4       1         1        a      0.2
5       1         1        b      0.2
6       1         2        a      0.2
7       1         2        b      0.2

How can I efficiently perform this transformation in Pandas?


Solution

  • Here is one way:

    mapping = {'FEATURE': [1, 2, 3], 'COLOR': [1, 2], 'CLASS': ['a', 'b', 'c', 'd']}
    cols = mapping.keys()
    
    for col in cols:
        df[col] = df[col].apply(lambda x: mapping[col] if pd.isna(x) else x)
        df = df.explode(col)
    

    For each column of interest, we fill the NaNs with the possible values (note that fillna doesn't fill with lists, hence the apply). Then we explode the so-filled column and repeat this for all three of the columns.

    We get the df at the end as:

      FEATURE COLOR CLASS  CONSTRAINTS
    0       1     1     a         0.20
    0       1     1     b         0.20
    0       1     1     c         0.20
    0       1     1     d         0.20
    0       1     2     a         0.20
    0       1     2     b         0.20
    0       1     2     c         0.20
    0       1     2     d         0.20
    1       3     1     a         0.20
    1       3     1     b         0.20
    1       3     1     c         0.20
    1       3     1     d         0.20
    1       3     2     a         0.20
    1       3     2     b         0.20
    1       3     2     c         0.20
    1       3     2     d         0.20
    2       1     1     a         0.15
    2       1     1     b         0.15
    2       1     1     c         0.15
    2       1     1     d         0.15
    3       1     1     b        -0.05
    3       1     2     b        -0.05
    4       1     1     a        -0.07
    5       1     1     b        -0.10
    6       3     1     a         0.10
    6       3     1     b         0.10
    6       3     1     c         0.10
    6       3     1     d         0.10
    7       1     1     a         0.20
    7       1     1     b         0.20
    7       1     1     c         0.20
    7       1     1     d         0.20
    7       1     2     a         0.20
    7       1     2     b         0.20
    7       1     2     c         0.20
    7       1     2     d         0.20
    7       2     1     a         0.20
    7       2     1     b         0.20
    7       2     1     c         0.20
    7       2     1     d         0.20
    7       2     2     a         0.20
    7       2     2     b         0.20
    7       2     2     c         0.20
    7       2     2     d         0.20
    7       3     1     a         0.20
    7       3     1     b         0.20
    7       3     1     c         0.20
    7       3     1     d         0.20
    7       3     2     a         0.20
    7       3     2     b         0.20
    7       3     2     c         0.20
    7       3     2     d         0.20
    

    which has 52 rows as expected.