How To Force Varnish Cache to Keep All Cached Pages Indefinitely?

We currently have a Varnish cache server configured in from of our Magento website. We run a cache warmer daily which pulls the sitemaps from the site and uses them to hit every page of our site in order to have them available in the Varnish cache. This works well for the most part, and ensures pages are warm when users hit them.

The issue we are seeing is that sometimes pages which were previously cached randomly need to be cached again. We've disabled the ability for Magento to purge URLs via the VCL, and increased the TTL to 365 days and still we are having issues with pages being removed from the cache. I've ensured we are not seeing any n_lru_nuked pages via varnishstat, so it is not a memory issue.

I'm not sure what would be causing these pages to be removed from the Varnish cache in a seemly random way. I've included our current VCL file below, is there something I'm missing which is causing the pages to lose their cache? If the VCL looks right is there something else that could be causing this issue outside of the VCL?

# VCL version 5.0 is not supported so it should be 4.0 even though actually used Varnish version is 5
vcl 4.0;

import std;
# The minimal Varnish version is 5.0
# For SSL offloading, pass the following header in your proxy server or load balancer: 'X-Forwarded-Proto: https'

backend default {
    .host = "127.0.0.1";
    .port = "8181";
    .first_byte_timeout = 600s;
    .probe = {
        .url = "/pub/health_check.php";
        .timeout = 2s;
        .interval = 5s;
        .window = 10;
        .threshold = 5;
   }
}

#acl purge {
#    "127.0.0.1";
#}

sub vcl_recv {
    if (req.method == "PURGE") {
        #if (client.ip !~ purge) {
            return (synth(405, "Method not allowed"));
        #}
        # To use the X-Pool header for purging varnish during automated deployments, make sure the X-Pool header
        # has been added to the response in your backend server config. This is used, for example, by the
        # capistrano-magento2 gem for purging old content from varnish during it's deploy routine.
        #if (!req.http.X-Magento-Tags-Pattern && !req.http.X-Pool) {
        #    return (synth(400, "X-Magento-Tags-Pattern or X-Pool header required"));
        #}
        #if (req.http.X-Magento-Tags-Pattern) {
        #  ban("obj.http.X-Magento-Tags ~ " + req.http.X-Magento-Tags-Pattern);
        #}
        #if (req.http.X-Pool) {
        #  ban("obj.http.X-Pool ~ " + req.http.X-Pool);
        #}
        #return (synth(200, "Purged"));
    }

    if (req.method != "GET" &&
        req.method != "HEAD" &&
        req.method != "PUT" &&
        req.method != "POST" &&
        req.method != "TRACE" &&
        req.method != "OPTIONS" &&
        req.method != "DELETE") {
          /* Non-RFC2616 or CONNECT which is weird. */
          return (pipe);
    }

    # We only deal with GET and HEAD by default
    if (req.method != "GET" && req.method != "HEAD") {
        return (pass);
    }

    # Bypass shopping cart, checkout and search requests
    if (req.url ~ "/checkout" || req.url ~ "/catalogsearch") {
        return (pass);
    }

    # Bypass health check requests
    if (req.url ~ "/pub/health_check.php") {
        return (pass);
    }

    # Set initial grace period usage status
    set req.http.grace = "none";

    # normalize url in case of leading HTTP scheme and domain
    set req.url = regsub(req.url, "^http[s]?://", "");

    # collect all cookies
    std.collect(req.http.Cookie);

    # Compression filter. See https://www.varnish-cache.org/trac/wiki/FAQ/Compression
    if (req.http.Accept-Encoding) {
        if (req.url ~ "\.(jpg|jpeg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") {
            # No point in compressing these
            unset req.http.Accept-Encoding;
        } elsif (req.http.Accept-Encoding ~ "gzip") {
            set req.http.Accept-Encoding = "gzip";
        } elsif (req.http.Accept-Encoding ~ "deflate" && req.http.user-agent !~ "MSIE") {
            set req.http.Accept-Encoding = "deflate";
        } else {
            # unknown algorithm
            unset req.http.Accept-Encoding;
        }
    }

    # Remove Google gclid parameters to minimize the cache objects
    set req.url = regsuball(req.url,"\?gclid=[^&]+$",""); # strips when QS = "?gclid=AAA"
    set req.url = regsuball(req.url,"\?gclid=[^&]+&","?"); # strips when QS = "?gclid=AAA&foo=bar"
    set req.url = regsuball(req.url,"&gclid=[^&]+",""); # strips when QS = "?foo=bar&gclid=AAA" or QS = "?foo=bar&gclid=AAA&bar=baz"

    # Static files caching
    if (req.url ~ "^/(pub/)?(media|static)/") {
        # Static files should not be cached by default
        #return (pass);

        # But if you use a few locales and don't use CDN you can enable caching static files by commenting previous line (#return (pass);) and uncommenting next 3 lines
        unset req.http.Https;
        unset req.http.X-Forwarded-Proto;
        unset req.http.Cookie;
    }

    return (hash);
}

sub vcl_hash {
#    if (req.http.cookie ~ "X-Magento-Vary=") {
#        hash_data(regsub(req.http.cookie, "^.*?X-Magento-Vary=([^;]+);*.*$", "\1"));
#    }

    # For multi site configurations to not cache each other's content
    if (req.http.host) {
        hash_data(req.http.host);
    } else {
        hash_data(server.ip);
    }

    # To make sure http users don't see ssl warning
    if (req.http.X-Forwarded-Proto) {
        hash_data(req.http.X-Forwarded-Proto);
    }
    
}

sub vcl_backend_response {

    set beresp.grace = 365d;

    if (beresp.http.content-type ~ "text") {
        set beresp.do_esi = true;
    }

    if (bereq.url ~ "\.js$" || beresp.http.content-type ~ "text") {
        set beresp.do_gzip = true;
    }

    if (beresp.http.X-Magento-Debug) {
        set beresp.http.X-Magento-Cache-Control = beresp.http.Cache-Control;
    }

    # cache only successfully responses and 404s
    if (beresp.status != 200 && beresp.status != 404) {
        set beresp.ttl = 0s;
        set beresp.uncacheable = true;
        return (deliver);
    } elsif (beresp.http.Cache-Control ~ "private") {
        set beresp.uncacheable = true;
        set beresp.ttl = 365d;
        return (deliver);
    }

    # validate if we need to cache it and prevent from setting cookie
    if (beresp.ttl > 0s && (bereq.method == "GET" || bereq.method == "HEAD")) {
        unset beresp.http.set-cookie;
    }

   # If page is not cacheable then bypass varnish for 2 minutes as Hit-For-Pass
   if (beresp.ttl <= 0s ||
       beresp.http.Surrogate-control ~ "no-store" ||
       (!beresp.http.Surrogate-Control &&
       beresp.http.Cache-Control ~ "no-cache|no-store") ||
       beresp.http.Vary == "*") {
        # Mark as Hit-For-Pass for the next 2 minutes
        set beresp.ttl = 120s;
        set beresp.uncacheable = true;
    }

    return (deliver);
}

sub vcl_deliver {
    #if (resp.http.X-Magento-Debug) {
        if (resp.http.x-varnish ~ " ") {
            set resp.http.X-Magento-Cache-Debug = "HIT";
            set resp.http.Grace = req.http.grace;
        } else {
            set resp.http.X-Magento-Cache-Debug = "MISS";
        }
    #} else {
    #    unset resp.http.Age;
    #}

    # Not letting browser to cache non-static files.
    if (resp.http.Cache-Control !~ "private" && req.url !~ "^/(pub/)?(media|static)/") {
        set resp.http.Pragma = "no-cache";
        set resp.http.Expires = "-1";
        set resp.http.Cache-Control = "no-store, no-cache, must-revalidate, max-age=0";
    }

    unset resp.http.X-Magento-Debug;
    unset resp.http.X-Magento-Tags;
    unset resp.http.X-Powered-By;
    unset resp.http.Server;
    unset resp.http.X-Varnish;
    unset resp.http.Via;
    unset resp.http.Link;
}

sub vcl_hit {
    if (obj.ttl >= 0s) {
        # Hit within TTL period
        return (deliver);
    }
    if (std.healthy(req.backend_hint)) {
        if (obj.ttl + 300s > 0s) {
            # Hit after TTL expiration, but within grace period
            set req.http.grace = "normal (healthy server)";
            return (deliver);
        } else {
            # Hit after TTL and grace expiration
            return (miss);
        }
    } else {
        # server is not healthy, retrieve from cache
        set req.http.grace = "unlimited (unhealthy server)";
        return (deliver);
    }
}

Solution

While you altered some beresp.grace and beresp.ttl settings, you didn't really increase the TTL for cacheable content in your VCL file.

You start with set beresp.grace = 365d; which increases the level of tolerated staleness.

In the part where beresp.http.Cache-Control ~ "private" is checked, you set the TTL to 365 days, but that doesn't matter, because that is setting the TTL of the Hit-For-Miss cache. You can use the default value of 120 seconds for that.

What you really need to do to fix the issue is add the following line above the line where you define the 365-day grace:

set beresp.ttl = 365d;

By setting this at the beginning, you use this as the default TTL.

You can monitor the TTL that was set using varnishlog. Here's the command you would use for that:

sudo varnishlog -g request -b -i berequrl -i TTL

See https://varnish-cache.org/docs/7.4/reference/vsl.html for the meaning of the various fields of the TTL tag.

UPDATE 1: clarifying Hit-for-miss

After having read your comments, it's important to understand the difference between the regular TTL and the Hit-for-miss TTL.

By setting set beresp.uncheable=true;, you're ensuring the response doesn't end up in the cache, probably because it's highly personalized and not cacheable at all. The TTL you set for uncacheable content defines how long that content is going to bypass the waiting list upon lookup.

We're basically caching the decision not to cache, which is very specific to Varnish and massively improves performance when a cache miss occurs. The waiting list is used to queue requests that trigger a cache miss. Instead of sending every miss to the backend, Varnish will merge them into a single backend request and satisfy them in parallel when the response comes in.

Because the content is uncacheable, it will never end up in a situation where multiple users will be served by that response. By bypassing that waiting list, performance improves significantly.

On a sidenote: the standard Magento VCL even has some logic where the TTL of uncacheable content is set to zero. This defeats the entire purpose of bypassing the waiting list. We're trying to get that fixed through a pull request.

UPDATE 2: interpreting the logs

After having run sudo varnishlog -g request -b -i berequrl -i TTL, you sent me some log output in the comments, which looks like this:

-- RFC 86400 10 0 1700578572 1700578572 1700578570 1700664971 86400, 
-- TTL VCL 31536000 10 0 1700578572, 
-- TTL VCL 31536000 31536000 0 1700578572

FYI, this is the spec of the TTL field:

%s %d %d %d %d [ %d %d %u %u ] %s
|  |  |  |  |    |  |  |  |    |
|  |  |  |  |    |  |  |  |    +- "cacheable" or "uncacheable"
|  |  |  |  |    |  |  |  +------ Max-Age from Cache-Control header
|  |  |  |  |    |  |  +--------- Expires header
|  |  |  |  |    |  +------------ Date header
|  |  |  |  |    +--------------- Age (incl Age: header value)
|  |  |  |  +-------------------- Reference time for TTL
|  |  |  +----------------------- Keep
|  |  +-------------------------- Grace
|  +----------------------------- TTL
+-------------------------------- "RFC", "VCL" or "HFP"

If the first field is RFC, this means we used the headers that were part of the RFC. That will probably be the Cache-Control header.

The second field returns the TTL that was set.

The third field displays the grace that was set (by setting beresp.grace in VCL or by using the stale-while-revalidate directive in the Cache-Control header.

The fourth field keeps track of the keep time that was set through beresp.keep (which defaults to zero).

The next couple of fields are timestamps when the object was stored in the cache. But since you seem to be using an older version of Varnish, the spec doesn't match 100%. The cacheable/uncacheable marker is not part of your output.

The second to last field is the expire time and the last field is the TTL value that was captured in the max-age directive of the cache-control header.

What the first line of your logs indicate is that an RFC-based header was set and that the TTL is 86400. This matches the last field, which is the max-age directive. This means a Cache-Control: public, max-age=86400 header was set by the application.

The second line features a VCL override of your TTL to 31536000 seconds. That's the result of set beresp.ttl = 365d;.

The third line features the VCL override of your grace value. The standard value is 10 seconds, but you also turned that into 31536000 seconds by stating setting set beresp.grace = 365d;