Search code examples
graphpercentilerrdtoolrrd

rrdtool graph ignoring --step?


I have RRD Files with multiple months of PDP Data (5min Interval).

For general purpose Graphs its fine, when rrdtool automatically decides which RRA to use for displaying the Graph.

But some of my Graphs contain 95-Percentile Data in the legend, which I need to be calculated from "exact" 5min-Interval Data, because calculation of Percentile from aggregated Data-Points can (by it's nature) lead to dramatically incorrect values.'

  • I can fetch Data from RRD File with a step of 300 and I'll get the right data to calculate percentile on my own
  • PROBLEM: When graph'ing with a step of 300, the displayed Percentile value varies depending on the width of the Graph, even if the Time-Range is the same, and 300s Data is available for the whole Time-Range
  • if width for 1-month graph is 800px, the shown Percentile (and also max-values e.g) is wrong
  • if width for 1-month graph is 8000px, the Values are correct (matching the self-calculated values from fetch'ed data)

graph:

...
--step 300
...
"VDEF:perca=a,95,PERCENT",
...

created with:

        '-s', '300',
       ...
        "RRA:AVERAGE:0.5:1:53568",      # 6 months pdp
        "RRA:AVERAGE:0.5:12:8904",      # 1 hour, 1 year.
        "RRA:AVERAGE:0.5:288:730",      # 1 day, 2 years.
        "RRA:AVERAGE:0.5:2016:520",     # 1 week, 10 years.
        "RRA:MAX:0.5:1:600",            # 5 min: 2 days
        "RRA:MAX:0.5:12:8904",          # 1 hour, 1 year.
        "RRA:MAX:0.5:288:730",          # 1 day, 2 years.
        "RRA:MAX:0.5:2016:520",         # 1 week, 10 years

Solution

  • This is due to data consolidation being performed prior to the VDEF calculation.

    Although your rrdtool graph arguments specify a step of 300s, this is less width than a pixel of the graph, and so the data series are further averaged before you get to the VDEF. All the CDEF and VDEF functions will always work with a time series of one cdp per pixel. From the RRDTool manual:

    Note: a step smaller than one pixel will silently be ignored.

    This means that, while you can decrease the resolution of the data, you cannot increase it. Sadly, to get an accurate 95th Percentile, you need higher-resolution data.

    So, if you omit the --step 300 in a narrow graph, what will happen is:

    • You ask for a 1-month time window
    • RRDTool calculates 1 pixel is about 1 hour
    • DS retrieves an Average time series from the 1hour RRA, one cdp per pixel (IE hour)
    • VDEF then consolidates this to a 95th percentile
    • The 95th percentile calculation is inaccurate

    With the --step 300 it is slightly different process, but the same result:

    • You ask for a 1-month time window, with step 300
    • RRDTool calculates 1 pixel is about 1 hour
    • RRDTool DS retrieves a month's worth of data from the 300s RRA
    • RRDTool further consolidates this data down to 1cdp per pixel (IE per hour) using Average
    • VDEF then consolidates this to a 95th percentile
    • The 95th percentile calculation is inaccurate

    So, you can see the final outcome is the same - its just where the 300s -> 1h consolidation happens, either in the RRA or at graph time.

    When using a wide graph, the time per pixel becomes smaller, and RRDTool then no longer needs to perform its additional consolidation of the data, resulting in a more accurate calculation:

    • You ask for a 1-month time window
    • RRDTool calculates 1 pixel is about 5 minutes
    • RRDTool DS retrieves a month's worth of data from the 300s RRA
    • No further consolidation is required
    • VDEF then consolidates this to a 95th percentile
    • The 95th percentile calculation is accurate!

    When you retrieve the raw data using rrdtool fetch1 then this extra consolodation doesn't happen, so you get:

    • You ask for a 1-month time window with step 300
    • RRDTool DS retrieves a month's worth of data from the 300s RRA
    • These data are output
    • Your spreadsheet then calculates a 95th percentile
    • The 95th percentile calculation is correct (well, as close as you can be with a 5min interval)

    Your next question will likely be, how do I stop this from happening? The unfortunate answer is that you cannot. RRDTool does not have a Percentile type CF, and so the correct calculations cannot be performed in the RRA (this would be the only real solution).

    The Routers2 frontend for MRTG calculated 95th Percentiles for the graphs, and the way it does it is to perform a high-resolution fetch to get the raw data and calculates the value internally before passing this in a HRULE when making the graph. In other words, it doesn't use a VDEF at all, due to this problem you are experiencing.