Search code examples
pythonpandasdataframeaggregatebins

Python custom aggregates - need a more efficient solution


I'm a new learner to Python and I'm playing with a dataset of interest to help with my learning, in particular trying to get a better understanding of pandas and numpy.

My dataframe has over a million rows and I'm trying to create a custom bucket so I can find more interesting insights. My dataset looks like the following:

My DataTable:

Price    Postal_area    Purchase_Month
123000   SE22           2018_01
240000   GU22           2017_02
.
.
.

I want to group the data into price buckets of < 100000, 200k - 300k, 300k - 500k, 500k+ I then want to group by the price buckets, month and postal area. I'm getting stumped at creating the custom price bucket.

What I've tried to do is create a custom function:

def price_range(Price):
    if (Price <= 100000):
        return ("Low Value")
    elif (100000 < Price < 200000):
        return ("Medium Value")
    elif (200001 < Price < 500000):
        return ("Medium High")
    elif (Price > 500001):
        return ("High")
    else:
        return ("Undefined")


And then I am creating a new column in my dataset as follows:

for val in (my_table.Price):
    my_table["price_range"] = (price_range(val))

I should be able to create an agg from this but its an extrememly slow process - already running at over 30 mins on a million or so rows and still running!

I have tried to play with creating custom buckets of data using numpy and pandas (pivot table, groupby, lambdas) but not been able to figure out how to incorporate the custom bucket logic.

I looked at a few other answers like the one below but it didn't cover my particular custom needs: Efficient way to assign values from another column pandas df

Any help much appreciated!


Solution

  • Use the apply function to apply your custom function price_range to my_table

    my_table['price_range']=my_table['Price'].apply(price_range)
    

    If you want bins with equal range:

    my_table['price_range']=pd.cut(my_table['Price'], bins = 4, labels = ['Low Value', 'Medium Value', 'Medium High', 'High'])