Search code examples
pythonpandasdataframeimputation

How to impute default values to Pandas dataframe columns?


Following is the df.head() of a DataSet with Date set as index.

               Article_ID Country_Code Sold_Units
     Date           
2017-01-01        3576        AT         1
2017-01-02        1579        FR         1
2017-01-02         332        FI         2
2017-01-03        3576        AT         1
2017-01-03         332        SE         1

The Country Code has 4 values 'AT, FR, FI, SE'. I want to append these 4 country_codes to every single date if they are not present on those dates and impute their values to 0 in Article_ID and Sold_Units.

The Example output should basically look like this:

               Article_ID Country_Code Sold_Units
     Date           
2017-01-01        3576        AT         1
2017-01-01          0         FR         0    # FR FI SE added with 0s.
2017-01-01          0         FI         0
2017-01-01          0         SE         0
2017-01-02          0         AT         0    # AT, SE added
2017-01-02        1579        FR         1
2017-01-02         332        FI         2
2017-01-02          0         SE         0
2017-01-03        3576        AT         1
2017-01-03          0         FR         0    # FR, FI added
2017-01-03          0         FI         0
2017-01-03         332        SE         1

How can I add such default values for every country code?


Solution

  • Solutions for unique DatetimeIndex with Country_Code combinations:

    Create MultiIndex by added Country_Code to DatetimeIndex with all combinations of unique values of datetimes with codes with DataFrame.reindex:

    df = (df.set_index('Country_Code', append=True)
            .reindex(pd.MultiIndex.from_product([df.index.unique(), df['Country_Code'].unique()], 
                                                names=['Date','Country_Code']), fill_value=0)
            .reset_index(level=1))
    print (df)
               Country_Code  Article_ID  Sold_Units
    Date                                           
    2017-01-01           AT        3576           1
    2017-01-01           FR           0           0
    2017-01-01           FI           0           0
    2017-01-01           SE           0           0
    2017-01-02           AT           0           0
    2017-01-02           FR        1579           1
    2017-01-02           FI         332           2
    2017-01-02           SE           0           0
    2017-01-03           AT        3576           1
    2017-01-03           FR           0           0
    2017-01-03           FI           0           0
    2017-01-03           SE         332           1
    

    Or use DataFrame.unstack with DataFrame.stack:

    df = (df.set_index('Country_Code', append=True)
            .unstack(fill_value=0)
            .stack()
            .reset_index(level=1))
    
    print (df)
               Country_Code  Article_ID  Sold_Units
    Date                                           
    2017-01-01           AT        3576           1
    2017-01-01           FI           0           0
    2017-01-01           FR           0           0
    2017-01-01           SE           0           0
    2017-01-02           AT           0           0
    2017-01-02           FI         332           2
    2017-01-02           FR        1579           1
    2017-01-02           SE           0           0
    2017-01-03           AT        3576           1
    2017-01-03           FI           0           0
    2017-01-03           FR           0           0
    2017-01-03           SE         332           1
    

    Solution for multiple values per datetimes with country codes:

    Error means data are like:

    print (df)
                Article_ID Country_Code  Sold_Units
    Date                                           
    2017-01-01        3576           AT           1
    2017-01-02        1579           FI           1 <-FI
    2017-01-02         332           FI           2 <-FI
    2017-01-03        3576           AT           1
    2017-01-03         332           SE           1
    

    df = (df.groupby(['Date','Country_Code'])
            .sum()
            .unstack(fill_value=0)
            .stack()
            .reset_index(level=1))
    print (df)
               Country_Code  Article_ID  Sold_Units
    Date                                           
    2017-01-01           AT        3576           1
    2017-01-01           FI           0           0
    2017-01-01           SE           0           0
    2017-01-02           AT           0           0
    2017-01-02           FI        1911           3
    2017-01-02           SE           0           0
    2017-01-03           AT        3576           1
    2017-01-03           FI           0           0
    2017-01-03           SE         332           1