I know pandas
and numpy
have binning functionalities, such as pd.cut
and np.digitalize
. But these come in useful when having large arrays / lists / dataframes. For my purposes it seems overkill to use these methods, since it's just a single variable in my project.
Right now I have a single continuous variable and I use the following function to bin it (make it discrete):
def bin_data(self): # noqa: C901
if self.value <= 300000:
bin_cat = 999
elif 300000 < self.value <= 500000:
bin_cat = 15000
elif 500000 < self.value <= 1000000:
bin_cat = 30000
elif 1000000 < self.value <= 2200000:
bin_cat = 60000
elif 2200000 < self.value <= 4400000:
bin_cat = 120000
elif 4400000 < self.value <= 8800000:
bin_cat = 180000
elif 8800000 < self.value <= 17500000:
bin_cat = 300000
elif 17500000 < self.value <= 35000000:
bin_cat = 600000
elif 35000000 < self.value <= 70000000:
bin_cat = 900000
elif 70000000 < self.value <= 140000000:
bin_cat = 1500000
else:
bin_cat = 3000000
But this results in a flake8 error C901: function too complex
.
Two questions:
not saying this code is complex, but you may be able to write this in a less error-prone less-complex way with a loop over some (end, result)
tuples. note also that flake8 does not detect complexity by default, you have to opt into that behaviour by setting a threshold
Since it didnt't fit in the comments section, here's an alternative way to write this while satisfying the complexity
BINS = (
(300000, 999),
(500000, 15000),
(1000000, 30000),
(2200000, 60000),
(4400000, 120000),
(8800000, 180000),
(17500000, 300000),
(35000000, 600000),
(70000000, 900000),
(14000000, 1600000),
)
OTHERWISE = 3000000
def bin_data(value):
for max_n, bucket in BINS:
if value <= max_n:
return bucket
else:
return OTHERWISE
disclaimer: though I don't think it matters for this particular information, I'm the current flake8 maintainer