I have a dataframe that is made of two columns, for example:
x | y |
---|---|
8492 | 119748 |
10581 | 106092 |
10409 | 114885 |
9812 | 162303 |
9676 | 105570 |
10162 | 66298 |
9042 | 50838 |
8646 | 114050 |
9531 | 117002 |
9151 | 90112 |
I would like to generate a histogram of the items in column x with bin size 500 (for example) but have another column that gives me the average of those values in column y for that bin.
So the result table would look something like:
Count x | Avg y | |
---|---|---|
8000 | 0 | 0 |
8500 | 1 | 119748 |
9000 | 1 | 114050 |
9500 | 2 | 70475 |
10000 | 3 | 128292 |
10500 | 2 | 90592 |
11000 | 1 | 106092 |
In this case I am showing the average, but I would like to have an option of collecting the min y, max y, median y.
I can write a function in Python that will do this, but wondering if there is any native way to do this with Pandas.
Thanks!
You basically want an aggregation with grouping over bins (or in pandas "cuts").
import pandas as pd
import numpy as np
df = pd.DataFrame({
"x" : [8492, 10581, 10409, 9812, 9676, 10162, 9042, 8646, 9531, 9151],
"y" : [119748, 106092, 114885, 162303, 105570, 66298, 50838, 114050, 117002, 90112]
})
The binning limits could be more thought through, but it does the job.
bins_size = 500
bins = np.arange(
(df.x.min()// bins_size) * bins_size,
(((df.x.max()// bins_size) + 1) * bins_size) + 1,
bins_size)
You can add whatever you want to the agg()
in the same manner such as "min"
, "max"
or even lambda functions (without quotes of course). Regarding "count"
it would not matter whether you pick x or y as long as neither has a NULL value within the group.
print(df
.assign(bined = lambda var : pd.cut(var.x, bins))
.groupby(["bined"])
.agg(
mean_y = ("y", "mean"),
median_y = ("y", "median"),
count = ("x", "count"))
.reset_index()
)
This will result in the table below:
bined mean_y median_y count
0 (8000, 8500] 119748.000000 119748.0 1
1 (8500, 9000] 114050.000000 114050.0 1
2 (9000, 9500] 70475.000000 70475.0 2
3 (9500, 10000] 128291.666667 117002.0 3
4 (10000, 10500] 90591.500000 90591.5 2
5 (10500, 11000] 106092.000000 106092.0 1