Search code examples
python-3.xpandasdataframebinning

Dynamically create bins from Pandas column such that No. of observations or count should be 1% of total count


I have pandas DataFrame of numeric columns. Currently, I'm working on one numeric column.

Total rows 34149
Mean 4112195398.901
Min -1687500000
Median 44531800
Max 56956420100000

This data does not contain any zeros(0)

I want to create integer bins as follows

 1. -1 for all negative values 
 2. Bin width 1000000 upto 20000000
 3. After 20000000, dynamically size bins to contain 1% of the total count or NOBS(No. of Observations) 
i.e. if NOBS is going below 1% of total NOBS the dynamically adjust the bin width such that it will be approx 1% of total NOBS.

I'm able to create integer bins upto the second condition mentioned above. Stuck at last condition.

Here is my code

df_ = df['Col1'].dropna()
df_1 = pd.Series(np.where(df_<0, -1, pd.Series(df_/1000000).astype(int)))

df_2 = pd.DataFrame(df_1, columns=['Bins'])
df_bins = df_2.value_counts(sort=False).reset_index(name='NOBS')
df_bins['Percentage'] = (df_bins['NOBS']/df_bins['NOBS'].sum())*100

The output of df_bins is

    Bins NOBS Percentage
   -1    195   0.943
    0   2155  10.420
    1   1594   7.707
    2   1209   5.846
    3    970   4.690
    .
    .
    .
    .
    .
    .
    .

The total of Percentage column will be 100% always. All bins will be integers.


Solution

  • What you expect is not clear but I think you can use pd.cut:

    import pandas as pd
    import numpy as np
    
    # Reproducible example
    rng = np.random.default_rng(42)
    a = rng.integers(-1687500000, 0, 5000)
    b = rng.integers(0, 20000000, 25000)
    c = rng.integers(20000000, 56956420100000, 4149)
    df = pd.DataFrame({'Col1': np.concatenate([a, b, c])})
    
    # Create bins
    m = df['Col1'] > 20000000
    q = np.linspace(0, 1, 100 * sum(m) // len(df))
    bins = [-np.inf] + np.arange(0, 20000000+1, 1000000).tolist()
    bins += df.loc[m, 'Col1'].quantile(q).tolist()[1:]
    
    out = (pd.cut(df['Col1'].dropna(), bins, labels=False).sub(1)
             .value_counts().rename('NOBS')
             .sort_index().rename_axis('Bins').reset_index()
             .assign(Percentage=lambda x: x['NOBS'] / len(df) * 100))
    

    Output:

    >>> out
        Bins  NOBS  Percentage
    0     -1  5000   14.641717
    1      0  1274    3.730710
    2      1  1169    3.423233
    3      2  1287    3.768778
    4      3  1241    3.634074
    5      4  1296    3.795133
    6      5  1246    3.648716
    7      6  1288    3.771706
    8      7  1201    3.516940
    9      8  1289    3.774635
    10     9  1174    3.437875
    11    10  1273    3.727781
    12    11  1246    3.648716
    13    12  1277    3.739495
    14    13  1292    3.783420
    15    14  1259    3.686784
    16    15  1276    3.736566
    17    16  1200    3.514012
    18    17  1242    3.637003
    19    18  1204    3.525725
    20    19  1266    3.707283
    21    20   378    1.106914
    22    21   377    1.103985
    23    22   377    1.103985
    24    23   377    1.103985
    25    24   377    1.103985
    26    25   377    1.103985
    27    26   377    1.103985
    28    27   377    1.103985
    29    28   377    1.103985
    30    29   377    1.103985
    31    30   378    1.106914