I have a table of float values and strings such as the following:
FltVal | Category |
---|---|
0.01 | A |
0.02 | A |
0.05 | B |
0.31 | B |
0.36 | A |
0.37 | C |
0.41 | B |
I would like to produce a new dataframe that bins the float values into bins of size 0.3 (so each bin is (x, x + 0.3]) and that overlaps in 0.01 increments. Additionally I'd like a column tallying the number of records following into the bin, as well as a column tallying each 'Category' total for the bin.
I first tried to get a basic count going with groupby and pd.cut (before trying to find counts for each Category):
import pandas as pd
floats = pd.Series([0.01,0.02,0.05,0.31,0.36,0.37,0.41])
categories = pd.Series(['A','A','B','B','A','C','B'])
data = {"FltVal": floats, "Category": categories}
df = pd.concat(data, axis=1)
grouped_vals = df.groupby(pd.cut(df['FltVal'],np.arange(df['FltVal'].min(),df['FltVal'].max(),0.3))).count()
Output:
FltVal Category
FltVal
(0.01, 0.31] 3 3
I think the problem there was that it couldn't consider another bin, let alone overlapping?
Then, with the idea I could just throw out non-sensible (like negatives) ranges afterward, I tried the following:
FltVal_range = np.arange(df['FltVal'].min(),df['FltVal'].max(),0.01)
FltVal_range_from = FltVal_range - 0.3
FltVal_range_to = FltVal_range
FltVal_intervals = pd.IntervalIndex.from_arrays(FltVal_range_from,FltVal_range_to)
binned_df = df.groupby(pd.cut(df['FltVal'], FltVal_intervals))
But got the ValueError message:
Overlapping IntervalIndex is not accepted.
If pd.cut doesn't accept overlapping bins, I guess it's a non-starter.
Regardless, desired output is the following:
FloatBin | Count | Cat_A | Cat_B | Cat_C |
---|---|---|---|---|
0.00, 0.30 | 3 | 2 | 1 | 0 |
0.01, 0.31 | 3 | 1 | 2 | 0 |
0.02, 0.32 | 2 | 0 | 2 | 0 |
0.03, 0.33 | 2 | 0 | 2 | 0 |
0.04, 0.34 | 2 | 0 | 2 | 0 |
0.05, 0.35 | 1 | 0 | 1 | 0 |
0.06, 0.36 | 2 | 1 | 1 | 0 |
0.07, 0.37 | 3 | 1 | 1 | 1 |
0.08, 0.38 | 3 | 1 | 1 | 1 |
0.09, 0.39 | 3 | 1 | 1 | 1 |
0.1, 0.4 | 3 | 1 | 1 | 1 |
0.11, 0.41 | 4 | 1 | 2 | 1 |
An efficient approach is necessary as the actual dataframe will have rows in the hundreds-of-thousands to millions.
I'm thinking of something similar to this answer but suitable for finding counts for all my categories.
All help is much appreciated!
One option using janitor
's conditional_join
:
import janitor
FltVal_range = np.arange(df['FltVal'].min(), df['FltVal'].max(), 0.01)
tmp = (pd.crosstab(df['FltVal'], df['Category'],
margins=True, margins_name='Count')
.drop('Count')
)
out = (pd.DataFrame({'low': FltVal_range, 'high': FltVal_range+0.3})
.conditional_join(tmp.reset_index(),
('low', 'FltVal', '<'),
('high', 'FltVal', '>='))
.groupby(['low', 'high'])[list(tmp)].sum()
)
Output:
A B C Count
low high
0.01 0.31 1 2 0 3
0.02 0.32 0 2 0 2
0.03 0.33 0 2 0 2
0.04 0.34 0 2 0 2
0.05 0.35 0 1 0 1
0.06 0.36 1 1 0 2
0.07 0.37 1 1 1 3
0.08 0.38 1 1 1 3
0.09 0.39 1 1 1 3
0.10 0.40 1 1 1 3
0.11 0.41 1 2 1 4
0.12 0.42 1 2 1 4
0.13 0.43 1 2 1 4
0.14 0.44 1 2 1 4
0.15 0.45 1 2 1 4
...