I have pandas DataFrame of numeric columns. Currently, I'm working on one numeric column.
Total rows 34149
Mean 4112195398.901
Min -1687500000
Median 44531800
Max 56956420100000
This data does not contain any zeros(0)
I want to create integer bins as follows
1. -1 for all negative values
2. Bin width 1000000 upto 20000000
3. After 20000000, dynamically size bins to contain 1% of the total count or NOBS(No. of Observations)
i.e. if NOBS is going below 1% of total NOBS the dynamically adjust the bin width such that it will be approx 1% of total NOBS.
I'm able to create integer bins upto the second condition mentioned above. Stuck at last condition.
Here is my code
df_ = df['Col1'].dropna()
df_1 = pd.Series(np.where(df_<0, -1, pd.Series(df_/1000000).astype(int)))
df_2 = pd.DataFrame(df_1, columns=['Bins'])
df_bins = df_2.value_counts(sort=False).reset_index(name='NOBS')
df_bins['Percentage'] = (df_bins['NOBS']/df_bins['NOBS'].sum())*100
The output of df_bins
is
Bins NOBS Percentage
-1 195 0.943
0 2155 10.420
1 1594 7.707
2 1209 5.846
3 970 4.690
.
.
.
.
.
.
.
The total of Percentage
column will be 100% always.
All bins will be integers.
What you expect is not clear but I think you can use pd.cut
:
import pandas as pd
import numpy as np
# Reproducible example
rng = np.random.default_rng(42)
a = rng.integers(-1687500000, 0, 5000)
b = rng.integers(0, 20000000, 25000)
c = rng.integers(20000000, 56956420100000, 4149)
df = pd.DataFrame({'Col1': np.concatenate([a, b, c])})
# Create bins
m = df['Col1'] > 20000000
q = np.linspace(0, 1, 100 * sum(m) // len(df))
bins = [-np.inf] + np.arange(0, 20000000+1, 1000000).tolist()
bins += df.loc[m, 'Col1'].quantile(q).tolist()[1:]
out = (pd.cut(df['Col1'].dropna(), bins, labels=False).sub(1)
.value_counts().rename('NOBS')
.sort_index().rename_axis('Bins').reset_index()
.assign(Percentage=lambda x: x['NOBS'] / len(df) * 100))
Output:
>>> out
Bins NOBS Percentage
0 -1 5000 14.641717
1 0 1274 3.730710
2 1 1169 3.423233
3 2 1287 3.768778
4 3 1241 3.634074
5 4 1296 3.795133
6 5 1246 3.648716
7 6 1288 3.771706
8 7 1201 3.516940
9 8 1289 3.774635
10 9 1174 3.437875
11 10 1273 3.727781
12 11 1246 3.648716
13 12 1277 3.739495
14 13 1292 3.783420
15 14 1259 3.686784
16 15 1276 3.736566
17 16 1200 3.514012
18 17 1242 3.637003
19 18 1204 3.525725
20 19 1266 3.707283
21 20 378 1.106914
22 21 377 1.103985
23 22 377 1.103985
24 23 377 1.103985
25 24 377 1.103985
26 25 377 1.103985
27 26 377 1.103985
28 27 377 1.103985
29 28 377 1.103985
30 29 377 1.103985
31 30 378 1.106914