Search code examples
pythonpandasgroup-byhistogramdensity-plot

bin value of histograms from grouped data


I am a beginner in Python and I am making separate histograms of travel distance per departure hour. Data I'm using, about 2500 rows of this. Distance is float64, the Departuretime is str. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.

Up until now, I have the following:

    df['Distance'].hist(by=df['Departuretime'], color = 'red', 
            edgecolor = 'black',figsize=(15,15),sharex=True,density=True)

This creates in my case a figure with 21 small histograms. Histogram output I'm receiving. Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns.

With single histograms, I'd paste counts, bins, bars = in front of the entire line and the variable counts would contain the data I was looking for, however, in this case it does not work.

Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Big thanks in advance!


Solution

  • First of all, note that the bins used in the different histograms that you are generating don't have the same edges (you can see this since you are using sharex=True and the resulting bars don't have the same width), in all cases you are getting 10 bins (the default), but they are not the same 10 bins. This makes it impossible to combine them all in a single table in any meaningful way. You could provide a fixed list of bin edges as the bins parameter to standarize this.

    Alternatively, I suggest you calculate a new column that describes to which bin each row belongs, this way we are also unifying the bins calulation. You can do this with the cut function, which also gives you the same freedom to choose the number of bins or the specific bin edges the same way as with hist.

    df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
    

    Then, you can use pivot_table to obtain a table with the counts for each combination of DistanceBin and Departuretime as rows and columns respectively as you asked.

    df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')