Search code examples
pythonpandasdataframedata-cleaningmissing-data

Pandas how to interpolate or fillna missing price values for car makes?


I'm using this automobile dataset and I have doubt on which method to use to fill the missing price values. Is that even something that is worth worrying about since price is subjective?

Is it better to just drop these rows?

I was thinking I could calculate the mean price for each of the specific makes and fillna with it - would that give me a close enough interpolation of the missing values?

distinctNullMakes = automobile.loc[automobile.Price.isna() , 'Make'].unique()

for make in distinctNullMakes:
    mask = automobile.Make == make 
    mean = round(automobile.Price[mask].mean())
    automobile[mask] = automobile[mask].fillna(mean)

Also this piece of code doesn't look very elegant, is there a better way (obviously I'm a noob)?


Solution

  • Right there is a whole discussion to be had on replacing missing with a mean score or the median score (think net worth at a company and add in Jeff Bazos from amazon, it will move the mean but not the median).

    With that discussion done and deciding on the mean, how to do it in a neater pythonic way would be? This is done in a one-line function below that does five things in order.

    Groupby + Apply + Lambda + Fillna + Mean

    automobile['Price']=automobile.groupby('Make')['Price'].apply(lambda x:x.fillna(x.mean()))
    

    which is a bit shorter though achieves the same as above. The main thing it does is remove the loop for the make and does it implicitly allowing the code underlying pandas to vectorize the result.

    if you wanted to group by more than one item (like 'make' and 'body type') then change the groupby('Make') to groupby(['Make','BodyType')