Search code examples
pythonpandasdataframefinanceimputation

Inserting missing rows with imputed values in Python


Problem

How can you insert rows for missing YEARS, with imputed annual SALES.

Progress

The following code computes the sales differences. However, it is for one year, using the explicit iloc pointer technique.

import pandas as pd

data = {"YEAR": [1990, 1995, 2000, 1990, 1995, 2000], 
        "COUNTRY": ["USA", "USA", "USA", "USA", "USA", "USA"],
        "STATE": ["AZ", "AZ", "AZ", "AZ", "AZ", "AZ"],
        "BRANCH":["Bed", "Bed", "Bed", "Kitchen", "Kitchen", "Kitchen"], 
        "SALES": [50, 80, 100, 10, 20, 50]}

df = pd.DataFrame(data)

value_first = df.iloc[0][4]
value_second = df.iloc[1][4]

delta_step = (value_second - value_first) / 5 # because 5 years between

for x in range(0, 6): 
    print((x * delta_step) + value_first)

Original Data

Original dataset

Goal

The goal is to insert the yellow highlighted rows, where sales are imputed using straight-line steps between 1990 and 2000 sales figures. Destination dataset


Solution

  • First fine the missing year, of each group and make that into a merge df

    idx=df.groupby(['COUNTRY','STATE','BRANCH'])['YEAR'].\
           apply(lambda x : pd.Series(range(min(x),max(x)+1))).\
             reset_index(level=[0,1,2])
    

    Then do merge

    yourdf=idx.merge(df,how='left')
    

    Then using interpolate impute the missing value

    yourdf['SALES']=yourdf.groupby(['COUNTRY','STATE','BRANCH'])['SALES'].apply(pd.Series.interpolate)