Search code examples
pythonpandasscikit-learnsklearn-pandas

Impute missing value by estimating the equation from non missing value in pandas and scikit-learn


I have a data frame as shown below

Price        Category      Area
20           Red           100
30           Red           150
10           Red           50
25           Red           NaN
30           Red           NaN
10           Green         30
20           Green         60
30           Green         90
10           Green         30
40           Green         NaN
50           Green         NaN

From the above, I would like to estimate the relation between Area and Price separately for each category Red and Green.

For example here for Red:

Area = 5 * Price

Similarly for Green:

Area = 3 * Price

I would like to implement that into pandas and scikit-learn:

Steps 1. Estimate the relation between Area and Price for each category. 2. Based on the relation impute the missing values.

I am very new in python and coding.

Expected output
Price        Category      Area
20           Red           100
30           Red           150
10           Red           50
25           Red           125
30           Red           150
10           Green         30
20           Green         60
30           Green         90
10           Green         30
40           Green         120
50           Green         150

Solution

  • You can also use a dictionary as below and then stack and fillna:

    d={'Area_Red' : 5 * df['Price'] , 'Area_Green' : 3 * df['Price']}
    df_ref=pd.DataFrame(d).rename(columns=lambda x: x.split('_')[1]).stack()
    df['Area']=df.set_index('Category',append=True)['Area'].fillna(df_ref).droplevel(1)
    

    Incase you want to just get the values as per df_ref and update the Area col (not just fillna), change the last line to below, use df.lookup:

    d={'Area_Red' : 5 * df['Price'] , 'Area_Green' : 3 * df['Price']}
    df_ref=pd.DataFrame(d).rename(columns=lambda x: x.split('_')[1]) #without stack
    df['Area']=df_ref.lookup(df['Category'].index,df['Category']) 
    

    print(df)
    
        Price Category  Area
    0      20      Red   100
    1      30      Red   150
    2      10      Red    50
    3      25      Red   125
    4      30      Red   150
    5      10    Green    30
    6      20    Green    60
    7      30    Green    90
    8      10    Green    30
    9      40    Green   120
    10     50    Green   150