Search code examples
pythonpandasdataframebinning

Output both bins and labels column in pandas binning


I have a dataframe column on which I would like to perform binning, for example:

df.head
X
4.6
2.5
3.1
1.7

I want one column for the bin range and one column for the label, as follows:

df.head
X bin label
4.6 (4,5] 5
2.5 (2,3] 3
3.1 (3,4] 4
1.7 (1,2] 2

Apparently, setting the label parameter as follows would just result in a column for bin labels, but not for the range anymore.

df['bin'] = df.X.apply(pd.cut, labels=np.arange(5))

Is there a more elegant solution to this instead of running pd.cut 2 times for the 2 columns?

Thanks


Solution

  • If you're allowing pd.cut to set the bin edges dynamically, you can use the retbins flag. From the pd.cut documentation:

    retbins: bool, default False
        Whether to return the bins or not. Useful when bins is provided as a scalar.
    

    This will return a second result:

    bins: numpy.ndarray or IntervalIndex.
        The computed or specified bins. Only returned when
        retbins=True. For scalar or sequence bins, this is
        an ndarray with the computed bins. If set
        duplicates=drop, bins will drop non-unique bin. For
        an IntervalIndex bins, this is equal to bins.
    

    You can use this to assign the bin edges to the frame:

    assignments, edges = pd.cut(df.X, bins=5, labels=False, retbins=True)
    df['label'] = assignments
    df['bin_floor'] = edges[assignments]
    df['bin_ceil'] = edges[assignments + 1]
    

    Your comments indicate that you'd like to use this within a groupby operation. In that case, you can wrap the above in a function:

    def assign_dynamic_bin_ids_and_labels(
        df,
        value_col,
        nbins,
        label_col='label',
        bin_floor_col='bin_floor',
        bin_ceil_col='bin_ceil',
    ):
        assignments, edges = pd.cut(
            df[value_col], bins=5, labels=False, retbins=True
        )
    
        df[label_col] = assignments
        df[bin_floor_col] = edges[assignments]
        df[bin_ceil_col] = edges[assignments + 1]
    
        return df
    
    df.groupby('id').apply(assign_dynamic_bin_ids_and_labels, 'X', 5)