Search code examples
cachingvarnishvarnish-vclquerystringparametervarnish-4

Varnish - use the cache when UTM_, gclid and other campaign params are used, otherwise pass if other querystring present


In short, how can the following rule be changed to allow caching if specified querystring parameters are present, but disallow caching if they are mixed with any other undefined parameters?

if (req.url~"\?.*$" && !req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
    set req.http.X - Cacheable = "NO:Contains Querystring";
    return (pass);
  }

Long Explanation: Ok, so have a varnish instance running with reverse Apache SSL terminator and a wordpress backend on Apache.

After deploying the default config, I have quickly learned that all querystrings are disabled from the cache, which is all well and good. However when an adwords visitor arrives, the url will be loaded with utm_ and other campaign specific parameters, which basically busts through the cache with the default config. This is not desired, as the pages are still static so its better to ignore these parameters and still peruse the cache. This is what I have implemented, and this rule works great on static pages being hit with any combination of defined utm/gclid/fbclid parameters.

sub vcl_recv {

  if (req.url~"\?.*$" && !req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
    set req.http.X - Cacheable = "NO:Contains Querystring";
    return (pass);
  }

  if (req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|fb_local|mr:[A-z]+)=") {
    set req.url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|fb_local|mr:[A-z]+)=[%.+-_A-z0-9]+&?", "");
  }
  set req.url = regsub(req.url, "(\?&?)$", "");

}

However there's a problem if there is a mix of defined and undefined params:

/home <-- varnish serves cached page
/home?gclid=x <-- varnish serves same cached page as above, great
/home?a=1 <-- caching disabled here
/home?a=1&gclid=x  <-- varnish redirects to /home?a=1 and serves an uncached page. I want varnish to not redirect here (retain the gclid for client in the url) and serve an uncached page.

I have then tried to change the rules to following:

sub vcl_recv {

  if (req.url~"\?.*$" && !req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
    set req.http.X - Cacheable = "NO:Contains Querystring";
    return (pass);
  }

  set req.http.x-cache-url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|fb_local|mr:[A-z]+)=[%.+-_A-z0-9]+&?", "");

}
sub vcl_hash {
    hash_data(req.http.x-cache-url);
    return (lookup);
}

This forces varnish to never use specified params in the hash. This works great not to do the redirect, but the undesired behaviour is that varnish will cache any urls containing both defined and undefined parameters - effectively allowing a way to poison the cache in the long run, so:

/home <-- varnish serves a cached page
/home?gclid=x <-- varnish serves same cached page as above, great
/home?a=1 <-- caching disabled here
/home?a=1&gclid=x  <-- varnish serves the cached page /home?a=1 however retains the original url. I want to avoid caching here with any undefined parameters in the querystring.

Has anyone got any ideas how I could define such a rule?


Solution

  • This is the VCL snippet I typically use to strip off tracking query string parameters:

    sub vcl_recv {
        if (req.url ~ "(\?|&)(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=") {
            set req.url = regsuball(req.url, "&(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "");
            set req.url = regsuball(req.url, "\?(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "?");
            set req.url = regsub(req.url, "\?&", "?");
            set req.url = regsub(req.url, "\?$", "");
        }
    }
    

    Here's the varnishlog -g request -i requrl output that proves how this works:

    $ varnishlog -g request -i requrl
    *   << Request  >> 32770
    -   ReqURL         /?gclid=x
    -   ReqURL         /?gclid=x
    -   ReqURL         /?
    -   ReqURL         /?
    -   ReqURL         /
    
    *   << Request  >> 5
    -   ReqURL         /?a=1&gclid=x
    -   ReqURL         /?a=1
    -   ReqURL         /?a=1
    -   ReqURL         /?a=1
    -   ReqURL         /?a=1
    **  << BeReq    >> 6
    

    All the ReqURL log lines illustrate how the URL evolves from its original value into the final value given the 4 changes it goes through once the regex pattern is matched.

    • If the URL is /?gclid=x, the parameter will be stripped off and the URL ends up being /
    • If the URL is /?a=1&gclid=x, the gclid parameter will be stripped off while the a parameter remains untouched.

    Update

    As mentioned by @sash in the comments, if certain query string parameters appear after having stripped off the tracking parameters, the cache needs to be bypassed.

    Here's the original VCL where an extra if-statement is added to bypass the cache:

    sub vcl_recv {
        if (req.url ~ "(\?|&)(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=") {
            set req.url = regsuball(req.url, "&(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "");
            set req.url = regsuball(req.url, "\?(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "?");
            set req.url = regsub(req.url, "\?&", "?");
            set req.url = regsub(req.url, "\?$", "");
        }
    
        if (req.url ~ "(\?|&)(a|b|c)=") {
            return(pass);
        }
    }
    

    In this example the appearance of the a, b or c querystring parameter causes the cache to be bypassed.

    Update 2

    After further feedback by @sash in the comments, here's a VCL snippet that will remove the tracking query string parameters.

    If any other ones appear that are not the tracking ones, bypass the cache

    sub vcl_recv {
        if (req.url ~ "(\?|&)(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=") {
            set req.url = regsuball(req.url, "&(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "");
            set req.url = regsuball(req.url, "\?(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "?");
            set req.url = regsub(req.url, "\?&", "?");
            set req.url = regsub(req.url, "\?$", "");
        }
    
        if (req.url ~ "\?[^&]+=") {
            return(pass);
        }
    }