Search code examples
pythonpandasmedian

Pandas dataframe apply calculations to selected rows


I am trying to calculate outliers in a dataset using Median Absolute Deviation. My dataset is 3 columns, 2000 rows in the following format:

Km      Price   id
139000  8500    2010 holden cruze cdx jg auto
173000  8500    2010 holden cruze cdx jg auto
95000   8800    2008 honda civic vti-l auto
141000  8800    2010 holden cruze cdx jg auto
169078  8880    1999 mazda mx-5 manual

How can I do array calculations like the following below. I want each row to have a 'median' value, which should be the median of all rows with the same id.

model = '2010 holden cruze cdx jg auto'
data[data['id']==model]['median'] = data[data['id']==model]['Price'].median()

The below statement provides the right median for each model, I just dont know how to quickly apply it to every row in the 'median' column.

median = data[data['id']==model]['Price'].median()

Or, alternatively, is my approach to Median Absolute Deviation wrong - is there a quicker/easier way to do it with pandas/numpy?


Solution

  • You can use transform which returns a series the same length as the dataframe:

    df['median'] = df.groupby('id').Price.transform('median')
    
    >>> df
           Km  Price                             id  median
    0  139000   8500  2010-holden-cruze-cdx-jg-auto    8500
    1  173000   8500  2010-holden-cruze-cdx-jg-auto    8500
    2   95000   8800    2008-honda-civic-vti-l-auto    8800
    3  141000   8800  2010-holden-cruze-cdx-jg-auto    8500
    4  169078   8880         1999-mazda-mx-5-manual    8880