Search code examples
pythonpandashistogram

How to scale y-axis for histogram pandas plot?


I have data for a whole year with an interval of fifteen minutes and want to create a histogram counting hours and not fifteen minutes.

Toy example code

I have following toy example code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv(r"D:/tmp/load.csv")
df.hist(bins=range(20,80,5))
sns.set()
plt.xlabel("Value")
plt.ylabel("count")
plt.show()

Which produces the following graph.

histogram

The data within the DataFrame is of the form:

>>> df[(df["Time"] > "2021-04-10 19:45:00") & (df["Time"] < "2021-04-10 21:00:00")]
                     Time      tag
9584  2021-04-10 20:00:00  50.3840
9585  2021-04-10 20:15:00  37.8332
9586  2021-04-10 20:30:00  36.6808
9587  2021-04-10 20:45:00  37.1840

Expected result

I need to change the y-axis values on the histogram so as to see the hours count and not the fifteen minutes count. So for the first column I should see 10 (40/4) instead of 40. So the whole y-axis should be divided by 4.

Question

How can I perform the scaling of the y-axis in the histogram? Should I work with plt.yticks function somehow?


Solution

  • Here is my take on your interesting question.

    I don't know of a way to rescale the y-axis after having plotted the dataframe, but you can rescale the dataframe itself.

    For instance, in the following toy dataframe, with an interval of measure of 15 minutes, 9 values are comprised between 35 and 40:

    • 4 values have been measured between 20:00:00 and 20:59:00
    • 1 between 21:00:00 and 21:59:00
    • 3 between 22:00:00 and 22:59:00
    • 1 between 23:00:00 and 23:59:00
    import pandas as pd
    
    df = pd.DataFrame(
        {
            "index": [
                "2021-04-10 20:00:00",
                "2021-04-10 20:15:00",
                "2021-04-10 20:30:00",
                "2021-04-10 20:45:00",
                "2021-04-10 21:00:00",
                "2021-04-10 21:15:00",
                "2021-04-10 21:30:00",
                "2021-04-10 21:45:00",
                "2021-04-10 22:00:00",
                "2021-04-11 22:15:00",
                "2021-04-11 22:30:00",
                "2021-04-11 22:45:00",
                "2021-04-11 23:00:00",
                "2021-04-11 23:15:00",
                "2021-04-11 23:30:00",
                "2021-04-11 23:45:00",
            ],
            "tag": [39, 36, 36, 37, 42, 28, 39, 54, 43, 38, 39, 36, 44, 27, 38, 28],
        },
    )
    df["index"] = pd.to_datetime(df["index"], format="%Y-%m-%d %H:%M:%S")
    

    Here is the corresponding plot:

    df.copy().set_index("index").plot(
        kind="hist", bins=range(20, 80, 5), yticks=range(0, 10), grid=True
    )
    

    enter image description here

    Had the measurement been hourly based, 4 values would have been found in the 35-40 bin:

    • 1 (and not 4) between 20:00:00 and 20:59:00
    • 1 between 21:00:00 and 21:59:00
    • 1 (and not 3) between 22:00:00 and 22:59:00
    • 1 between 23:00:00 and 23:59:00

    So, rescaling the dataframe hourly suppose to:

    • assign new columns for bins, dates and hours
    • sort values and drop rows with same bin, date and hour, keeping only the first duplicate row
    • cleanup and plot
    _ = (
        df.assign(
            bin=pd.cut(df["tag"], bins=range(20, 60, 5)),
            date=df["index"].dt.date,
            hour=df["index"].dt.hour,
        )
        .sort_values(by=["bin", "date", "hour"])
        .drop_duplicates(subset=["bin", "date", "hour"], keep="first")
        .drop(columns=["bin", "date", "hour"])
        .set_index("index")
        .plot(kind="hist", bins=range(20, 80, 5), yticks=range(0, 5), grid=True)
    )
    

    Which outputs:

    enter image description here