Search code examples
pythonpandashistogram

pandas - two-dimensional histogram and averages


I have a dataframe that is made of two columns, for example:

x y
8492 119748
10581 106092
10409 114885
9812 162303
9676 105570
10162 66298
9042 50838
8646 114050
9531 117002
9151 90112

I would like to generate a histogram of the items in column x with bin size 500 (for example) but have another column that gives me the average of those values in column y for that bin.

So the result table would look something like:

Count x Avg y
8000 0 0
8500 1 119748
9000 1 114050
9500 2 70475
10000 3 128292
10500 2 90592
11000 1 106092

In this case I am showing the average, but I would like to have an option of collecting the min y, max y, median y.

I can write a function in Python that will do this, but wondering if there is any native way to do this with Pandas.

Thanks!


Solution

  • You basically want an aggregation with grouping over bins (or in pandas "cuts").

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({
        "x" : [8492, 10581, 10409, 9812, 9676, 10162, 9042, 8646, 9531, 9151],
        "y" : [119748, 106092, 114885, 162303, 105570, 66298, 50838, 114050, 117002, 90112]
        })
    

    The binning limits could be more thought through, but it does the job.

    bins_size = 500
    bins = np.arange(
        (df.x.min()// bins_size) * bins_size, 
        (((df.x.max()// bins_size) + 1) * bins_size) + 1, 
        bins_size)
    

    You can add whatever you want to the agg() in the same manner such as "min", "max" or even lambda functions (without quotes of course). Regarding "count" it would not matter whether you pick x or y as long as neither has a NULL value within the group.

    print(df
        .assign(bined = lambda var : pd.cut(var.x, bins))
        .groupby(["bined"])
        .agg(
            mean_y = ("y", "mean"),
            median_y = ("y", "median"),
            count = ("x", "count"))
        .reset_index()
        )
    

    This will result in the table below:

                bined         mean_y  median_y  count
    0    (8000, 8500]  119748.000000  119748.0      1
    1    (8500, 9000]  114050.000000  114050.0      1
    2    (9000, 9500]   70475.000000   70475.0      2
    3   (9500, 10000]  128291.666667  117002.0      3
    4  (10000, 10500]   90591.500000   90591.5      2
    5  (10500, 11000]  106092.000000  106092.0      1