I'm working in a user segmentation process ('RFM' Segmentation), where users are categorized based on the 'buckets' that they are in.
As a quick example example, a user may be inside a 'champions' or 'cannot lose' bucket based on their activity and purchases (their RFM Score).
This is all calculated using an algorithm that is explained here: https://towardsdatascience.com/recency-frequency-monetary-model-with-python-and-how-sephora-uses-it-to-optimize-their-google-d6a0707c5f17
In the end, it is calculated as in this example :
if RFM_Score >= 9:
return "Cannot lose them"
elif ((RFM_Score >= 8) and (RFM_Score < 9)):
return "Winners"
Now, I want to offer the user the possibility to configure the boundaries (and names) of the buckets.
Is it possible to build a dynamic if-else structure that can be configured by parameters?
I thought about some kind of dictionary, like so:
#The first value in the tuple is the lowerbound, the second value is the upperbound.
params={'cannot lose':(9,), 'winners':(8, 9), [...] 'promising':(4, 5)}
def find_class(value):
for classname, boundaries in params:
if value >= boundaries[0]:
if len(boundaries) == 1:
return classname
elif value < boundaries[1]:
return classname
However, I'm affraid that this will make the algorithm much more complex (imagine that we're running this over potentially tens of millions of entries), while I think that the simple if/else will be fastest due to the way python interpreter is implemented.
I would like a light on: (1) is the dict approach acceptable? Which are possible downturns; (2) is it much slower?
You should take a look at pandas.cut
which can divide values into buckets and labels them accordingly:
import pandas as pd
values = [8,10,6,4,4,1]
labels = pd.cut(values, bins = [0,4,5,8,10],
labels = ["not so promising", "promising", "winners", "cannot lose them"])
I would assume that this is pretty optimised and will probably perform at least not much worse than a self implemented version based on foor loops and if else statements.
By default the bins are right closed intervals, so in the above example it should be ((0,4], (4,5], (5, 8], (8, 10]). That means that the point is categorised into the interval where it is larger than the left bound but not greater than the right bound. This behaviour can be adapted with the arguments right
or include_lowest
(see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html)