I have a data frame as shown below
Price Category Area
20 Red 100
30 Red 150
10 Red 50
25 Red NaN
30 Red NaN
10 Green 30
20 Green 60
30 Green 90
10 Green 30
40 Green NaN
50 Green NaN
From the above, I would like to estimate the relation between Area and Price separately for each category Red and Green.
For example here for Red:
Area = 5 * Price
Similarly for Green:
Area = 3 * Price
I would like to implement that into pandas and scikit-learn:
Steps 1. Estimate the relation between Area and Price for each category. 2. Based on the relation impute the missing values.
I am very new in python and coding.
Expected output
Price Category Area
20 Red 100
30 Red 150
10 Red 50
25 Red 125
30 Red 150
10 Green 30
20 Green 60
30 Green 90
10 Green 30
40 Green 120
50 Green 150
You can also use a dictionary as below and then stack and fillna:
d={'Area_Red' : 5 * df['Price'] , 'Area_Green' : 3 * df['Price']}
df_ref=pd.DataFrame(d).rename(columns=lambda x: x.split('_')[1]).stack()
df['Area']=df.set_index('Category',append=True)['Area'].fillna(df_ref).droplevel(1)
Incase you want to just get the values as per df_ref
and update the Area
col (not just fillna), change the last line to below, use df.lookup
:
d={'Area_Red' : 5 * df['Price'] , 'Area_Green' : 3 * df['Price']}
df_ref=pd.DataFrame(d).rename(columns=lambda x: x.split('_')[1]) #without stack
df['Area']=df_ref.lookup(df['Category'].index,df['Category'])
print(df)
Price Category Area
0 20 Red 100
1 30 Red 150
2 10 Red 50
3 25 Red 125
4 30 Red 150
5 10 Green 30
6 20 Green 60
7 30 Green 90
8 10 Green 30
9 40 Green 120
10 50 Green 150