Search code examples
pythonpandasdatemulti-indexfillna

Filling dates in dataframe with triple index


I know that a similiar question to this one has been made, but the solution works when you have only one categorical variable. I have two of those, and MultiIndexes have always been difficult for me to work with. The thing is, I've got the following dataframe:

 Date        Product    eCommerce   Sales
12-10-2018      A           1        10
12-12-2018      A           0        7
12-13-2018      A           1        4
12-15-2018      A           1        2
12-15-2018      A           0        2
12-11-2018      B           1        8
12-13-2018      B           1        6

So I need to know sales with and without eCommerce for every date and fill the dates that are not appearing with zero sales. My desired output would be:

 Date        Product    eCommerce   Sales
12-10-2018      A           1        10
12-11-2018      A           1        0
12-12-2018      A           1        0
12-13-2018      A           1        4
12-14-2018      A           1        0
12-15-2018      A           1        2

12-12-2018      A           0        7
12-13-2018      A           0        0
12-14-2018      A           0        0
12-15-2018      A           0        2

12-11-2018      B           1        8
12-12-2018      B           1        0
12-13-2018      B           1        6

Note: It's just one DataFrame, I just added the spaces to differentiate the indexes.

So in the original DataFrame 12-11 and 12-14 are missing for eCommerce Sales of Product A, 12-13 and 12-14 for non-eCommerce Sales of Product A and 12-12 is missing for eCommerce Sales of product B

The thing is, I managed to achieve what I want, through a very inefficient loop that takes about 25 minutes to run:

df_full= pd.DataFrame(columns=df.columns)
for sku in df['Product'].unique():

    aux=df.loc[df['Product']==sku]
    dates= pd.DataFrame(pd.date_range(start=aux.Date.min(), end=aux.Date.max(),freq='D'),columns=['Date'])

    df3 = df.loc[df['id_prod']==sku].merge(dates,
                                                   how='outer',left_on='Date',
                                                   right_on='Date').sort_values(by='Date')

    df3.fillna(method='ffill',inplace=True)
    df_full= df_full.append(df3)

I'm quite confident that I can do this in a vectorized way, that should take less time (I have 2,300 products for 290 possible dates). Do you know how can I achieve this?

Edit: Added bolded text with better explanation of the problem


Solution

  • Use a single DatetimeIndex then groupby + resample.asfreq(), (can use sum for numeric columns) as the date range is group dependent.

    import pandas as pd
    
    df['Date'] = pd.to_datetime(df.Date)
    df = df.set_index('Date')
    
    df.groupby(['Product', 'eCommerce'], sort=False).Sales.resample('D').sum().reset_index()
    

    If you have many columns you want to fill with zero then:

    (df.groupby(['Product', 'eCommerce'], sort=False)
        .resample('D').sum()
        .drop(columns=['Product', 'eCommerce'])
        .reset_index())
    

    Output:

       Product  eCommerce       Date  Sales
    0        A          1 2018-12-10     10
    1        A          1 2018-12-11      0
    2        A          1 2018-12-12      0
    3        A          1 2018-12-13      4
    4        A          1 2018-12-14      0
    5        A          1 2018-12-15      2
    6        A          0 2018-12-12      7
    7        A          0 2018-12-13      0
    8        A          0 2018-12-14      0
    9        A          0 2018-12-15      2
    10       B          1 2018-12-11      8
    11       B          1 2018-12-12      0
    12       B          1 2018-12-13      6