pandas - two-dimensional histogram and averages

I have a dataframe that is made of two columns, for example:

x	y
8492	119748
10581	106092
10409	114885
9812	162303
9676	105570
10162	66298
9042	50838
8646	114050
9531	117002
9151	90112

I would like to generate a histogram of the items in column x with bin size 500 (for example) but have another column that gives me the average of those values in column y for that bin.

So the result table would look something like:

	Count x	Avg y
8000	0	0
8500	1	119748
9000	1	114050
9500	2	70475
10000	3	128292
10500	2	90592
11000	1	106092

In this case I am showing the average, but I would like to have an option of collecting the min y, max y, median y.

I can write a function in Python that will do this, but wondering if there is any native way to do this with Pandas.

Thanks!

Solution

You basically want an aggregation with grouping over bins (or in pandas "cuts").

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "x" : [8492, 10581, 10409, 9812, 9676, 10162, 9042, 8646, 9531, 9151],
    "y" : [119748, 106092, 114885, 162303, 105570, 66298, 50838, 114050, 117002, 90112]
    })

The binning limits could be more thought through, but it does the job.

bins_size = 500
bins = np.arange(
    (df.x.min()// bins_size) * bins_size, 
    (((df.x.max()// bins_size) + 1) * bins_size) + 1, 
    bins_size)

You can add whatever you want to the agg() in the same manner such as "min", "max" or even lambda functions (without quotes of course). Regarding "count" it would not matter whether you pick x or y as long as neither has a NULL value within the group.

print(df
    .assign(bined = lambda var : pd.cut(var.x, bins))
    .groupby(["bined"])
    .agg(
        mean_y = ("y", "mean"),
        median_y = ("y", "median"),
        count = ("x", "count"))
    .reset_index()
    )

This will result in the table below:

            bined         mean_y  median_y  count
0    (8000, 8500]  119748.000000  119748.0      1
1    (8500, 9000]  114050.000000  114050.0      1
2    (9000, 9500]   70475.000000   70475.0      2
3   (9500, 10000]  128291.666667  117002.0      3
4  (10000, 10500]   90591.500000   90591.5      2
5  (10500, 11000]  106092.000000  106092.0      1