I am trying to calculate outliers in a dataset using Median Absolute Deviation. My dataset is 3 columns, 2000 rows in the following format:
Km Price id
139000 8500 2010 holden cruze cdx jg auto
173000 8500 2010 holden cruze cdx jg auto
95000 8800 2008 honda civic vti-l auto
141000 8800 2010 holden cruze cdx jg auto
169078 8880 1999 mazda mx-5 manual
How can I do array calculations like the following below. I want each row to have a 'median' value, which should be the median of all rows with the same id.
model = '2010 holden cruze cdx jg auto'
data[data['id']==model]['median'] = data[data['id']==model]['Price'].median()
The below statement provides the right median for each model, I just dont know how to quickly apply it to every row in the 'median' column.
median = data[data['id']==model]['Price'].median()
Or, alternatively, is my approach to Median Absolute Deviation wrong - is there a quicker/easier way to do it with pandas/numpy?
You can use transform
which returns a series the same length as the dataframe:
df['median'] = df.groupby('id').Price.transform('median')
>>> df
Km Price id median
0 139000 8500 2010-holden-cruze-cdx-jg-auto 8500
1 173000 8500 2010-holden-cruze-cdx-jg-auto 8500
2 95000 8800 2008-honda-civic-vti-l-auto 8800
3 141000 8800 2010-holden-cruze-cdx-jg-auto 8500
4 169078 8880 1999-mazda-mx-5-manual 8880