Search code examples
regexurilooker-studiore2

How to extract the year from a URL path using REGEXP_EXTRACT in Google Data Studio?


I'm building out a Google Data Studio dashboard and I need to create a calculated field for the year a post was published. The year is in the URI path, but I'm not sure how to extract it using REGEXP_EXTRACT. I've tried a number of solutions proposed on here but none of them seem to work on Data Studio.

In short, I have a URI like this: /theme/2019/jan/blog-post-2019/

How do I use the REGEXP_EXTRACT function to get the first 2019 after theme/ and before /jan?


Solution

  • Try this:

    REGEXP_EXTRACT(Page, 'theme\/([0-9]{4})\/[a-z]{3}\/') 
    

    where:

    • theme\/ means literally "theme/";
    • ([0-9]{4}) is a capturing group containing 4 characters from 0 to 9 (i.e. four digits);
    • \/[a-z]{3}\/ means a slash, followed by 3 lowercase letters (supposing that you want the regex to match all the months), followed by another slash. If you want something more restrictive, try with \/(?:jan|feb|mar|...)\/ for the last part.

    See demo.