I'm a new learner to Python and I'm playing with a dataset of interest to help with my learning, in particular trying to get a better understanding of pandas and numpy.
My dataframe has over a million rows and I'm trying to create a custom bucket so I can find more interesting insights. My dataset looks like the following:
My DataTable:
Price Postal_area Purchase_Month
123000 SE22 2018_01
240000 GU22 2017_02
.
.
.
I want to group the data into price buckets of < 100000, 200k - 300k, 300k - 500k, 500k+ I then want to group by the price buckets, month and postal area. I'm getting stumped at creating the custom price bucket.
What I've tried to do is create a custom function:
def price_range(Price):
if (Price <= 100000):
return ("Low Value")
elif (100000 < Price < 200000):
return ("Medium Value")
elif (200001 < Price < 500000):
return ("Medium High")
elif (Price > 500001):
return ("High")
else:
return ("Undefined")
And then I am creating a new column in my dataset as follows:
for val in (my_table.Price):
my_table["price_range"] = (price_range(val))
I should be able to create an agg from this but its an extrememly slow process - already running at over 30 mins on a million or so rows and still running!
I have tried to play with creating custom buckets of data using numpy and pandas (pivot table, groupby, lambdas) but not been able to figure out how to incorporate the custom bucket logic.
I looked at a few other answers like the one below but it didn't cover my particular custom needs: Efficient way to assign values from another column pandas df
Any help much appreciated!
Use the apply
function to apply your custom function price_range
to my_table
my_table['price_range']=my_table['Price'].apply(price_range)
If you want bins with equal range:
my_table['price_range']=pd.cut(my_table['Price'], bins = 4, labels = ['Low Value', 'Medium Value', 'Medium High', 'High'])