Search code examples
pythonpandasbinning

Symmetric number of bins in qcut around zero


I have a pandas dataframe with different number of integers and NaNs in each row. I would like to allocate values in each row into 8 bins - 4 bins for negative values and 4 bins for positive values per row. So, there will be different number of values in each bin per row. Any hints on how to adjust qcut function for that? Thanks!


Solution

  • If I understand correctly, you could just do a qcut on positive values and a qcut on negative values.

    For example, given the dataframe:

    >>> df
            vals
    0  -0.456460
    1   0.448368
    2   0.186750
    3   1.056617
    4  -0.035620
    5  -0.609843
    6   0.126376
    7   0.160817
    8  -1.495441
    9   0.730763
    10 -0.005071
    11  0.677918
    12 -0.779553
    13  0.717374
    14  2.250258
    15 -0.801028
    16  0.306408
    17  0.538970
    18 -2.120528
    19  1.066903
    

    Use 2 qcuts, one for positive and one for negative.

    df.loc[df.vals > 0,'bin'] = pd.qcut(df.loc[df.vals > 0,'vals'], q=4)
    
    df.loc[df.vals < 0,'bin'] = pd.qcut(df.loc[df.vals < 0,'vals'], q=4)
    

    And as a result, they are binned into 8 unique bins, 4 for positive and 4 for negative:

    >>> df
            vals                 bin
    0  -0.456460    (-0.695, -0.351]
    1   0.448368      (0.276, 0.608]
    2   0.186750      (0.125, 0.276]
    3   1.056617       (0.812, 2.25]
    4  -0.035620  (-0.351, -0.00507]
    5  -0.609843    (-0.695, -0.351]
    6   0.126376      (0.125, 0.276]
    7   0.160817      (0.125, 0.276]
    8  -1.495441    (-2.122, -0.975]
    9   0.730763      (0.608, 0.812]
    10 -0.005071  (-0.351, -0.00507]
    11  0.677918      (0.608, 0.812]
    12 -0.779553    (-0.975, -0.695]
    13  0.717374      (0.608, 0.812]
    14  2.250258       (0.812, 2.25]
    15 -0.801028    (-0.975, -0.695]
    16  0.306408      (0.276, 0.608]
    17  0.538970      (0.276, 0.608]
    18 -2.120528    (-2.122, -0.975]
    19  1.066903       (0.812, 2.25]
    

    You can sort the bins to visualize them like this, allowing you to see 4 bins for positive values and 4 bins for negative values:

    np.sort(df['bin'].unique())
    
    array([Interval(-2.1219999999999999, -0.97499999999999998, closed='right'),
           Interval(-0.97499999999999998, -0.69499999999999995, closed='right'),
           Interval(-0.69499999999999995, -0.35099999999999998, closed='right'),
           Interval(-0.35099999999999998, -0.0050699999999999999, closed='right'),
           Interval(0.125, 0.27600000000000002, closed='right'),
           Interval(0.27600000000000002, 0.60799999999999998, closed='right'),
           Interval(0.60799999999999998, 0.81200000000000006, closed='right'),
           Interval(0.81200000000000006, 2.25, closed='right')], dtype=object)