Search code examples
pythonflake8

Bin a continuous variable without getting C901 flake8 too complex


I know pandas and numpy have binning functionalities, such as pd.cut and np.digitalize. But these come in useful when having large arrays / lists / dataframes. For my purposes it seems overkill to use these methods, since it's just a single variable in my project.

Right now I have a single continuous variable and I use the following function to bin it (make it discrete):

def bin_data(self):  # noqa: C901
    if self.value <= 300000:
        bin_cat = 999
    elif 300000 < self.value <= 500000:
        bin_cat = 15000
    elif 500000 < self.value <= 1000000:
        bin_cat = 30000
    elif 1000000 < self.value <= 2200000:
        bin_cat = 60000
    elif 2200000 < self.value <= 4400000:
        bin_cat = 120000
    elif 4400000 < self.value <= 8800000:
        bin_cat = 180000
    elif 8800000 < self.value <= 17500000:
        bin_cat = 300000
    elif 17500000 < self.value <= 35000000:
        bin_cat = 600000
    elif 35000000 < self.value <= 70000000:
        bin_cat = 900000
    elif 70000000 < self.value <= 140000000:
        bin_cat = 1500000
    else:
        bin_cat = 3000000

But this results in a flake8 error C901: function too complex.

Two questions:

  1. What is wrong with the code like this, I don't find it "complex".
  2. How would we make this "easier"?

Solution

  • not saying this code is complex, but you may be able to write this in a less error-prone less-complex way with a loop over some (end, result) tuples. note also that flake8 does not detect complexity by default, you have to opt into that behaviour by setting a threshold

    Since it didnt't fit in the comments section, here's an alternative way to write this while satisfying the complexity

    BINS = (
        (300000, 999),
        (500000, 15000),
        (1000000, 30000),
        (2200000, 60000),
        (4400000, 120000),
        (8800000, 180000),
        (17500000, 300000),
        (35000000, 600000),
        (70000000, 900000),
        (14000000, 1600000), 
    )
    OTHERWISE = 3000000
    
    
    def bin_data(value):
        for max_n, bucket in BINS:
            if value <= max_n:
                return bucket
        else:
            return OTHERWISE
    

    disclaimer: though I don't think it matters for this particular information, I'm the current flake8 maintainer