Search code examples
pythonpandasnumpybinning

Bin variable with pre-defined bins and closed/open intervals


I have a set of bins which can be defined by:

  1. A set of tuples of non-overlaping boundaries of each bin:

    Intervals: [(0,1), (1,2), (3,4)]

  2. A set of indicators identifying which boundary of each tuple is closed:

    Closed_Boundaries: ['right','right','both']

  3. A set of labels for each interval

    Labels: ['first','second','third']

I am looking for an efficient, elegant and scalable way of applying this binning to a numeric series in a pandas dataframe so that the result includes the respective labels as identified by the binning logic:

Data_input: [3.5, 1, 0.5, 3]

Data_result: ['third', 'first', 'first', 'third']

I tried with pandas.IntervalIndex.from_tuples() followed by pandas.cut(). However, the labels argument of pandas.cut() is disabled when using IntervalIndex.from_tuples() while the name argument of the latter doesn't allow me to set labels to be used as replacement values.

PS: The pandas problem with labels not supported with IntervalIndex has been discussed here.


Solution

  • If all intervals are closed on the same side

    The simplest way is to use pd.cut. However, there is an outstanding bug where it ignores labels when bins is IntervalIndex.

    def cut(array, bins, labels, closed='right'):
        _bins = pd.IntervalIndex.from_tuples(bins, closed=closed)
    
        x = pd.cut(array, _bins)
        x.categories = labels # workaround for the bug
        return x
    
    array = [3.5, 1, 0.5, 3]
    bins = [(0,1), (1,2), (3,4)]
    labels = ['first', 'second', 'third']
    
    df = pd.DataFrame({
        'value': array,
        'category': cut(array, bins, labels, closed='right')
    })
    

    Output:

       value category
    0    3.5    third
    1    1.0    first
    2    0.5    first
    3    3.0      NaN
    

    If every interval is different

    Things get a lot slower because the code is not vectorized but it's conceptually simple: for each item in the array, find the first bin that it falls into and add the label of that bin.

    def cut(array, bins, labels):
        intervals = [pd.Interval(*b) for b in bins]
    
        categories = []
        for value in array:
            cat = None
            for i, interval in enumerate(intervals):
                if value in interval:
                    cat = labels[i]
                    break
            categories.append(cat)
    
        return categories
    
    cut([3.5, 1, 0.5, 3], bins=[(0,1,'right'),(1,2,'right'),(3,4,'left')], labels=['first', 'second', 'third'])
    

    I modified the bin tuples to include what side they are closed on. Options are left, right, both and neither.