Search code examples
regexposixamazon-redshift

Trouble Capturing URLs With Campaign Tracking Parameters in Redshift


I am trying to capture URLs that have tracking parameters in a query to a website's homepage. There are cases where the parameter can have a forward slash before the query begins. Here are two examples that should match:

https://test.com/?utm_campaign=email
https://test.com?utm_campaign=email

Here are two examples that should not match:

 https://test.com/blog
 https://test.com/blog?utm_campaign=email

Here is an example query:

SELECT t.url,COUNT(t.id) AS pageviews
FROM db.table AS t
WHERE t.url ~ '^https*:\\/\\/test\\.com\\?'
GROUP BY 1
ORDER BY 2 DESC

Note that Redshift documentation states:

To search for strings that include metacharacters, such as ‘. * | ? ‘, and so on, escape the character using two backslashes (' \\ ')

I have tried both single and double slashes. The single slash returns a lot more than I expect, whereas the double slash does not return any results. I'm more accustomed to writing regex in Javascript, and as such I assume I'm having trouble translating between the two; any help is much appreciated.


Solution

  • The / symbol is not a special regex metacharacter, you should not escape it. Besides, in order to avoid issues with escaping . or ?, you may put them into bracket expressions:

    WHERE t.url ~ '^https?://test[.]com[?]'
    

    It will match:

    • ^ - start of string
    • https?://test[.]com[?] - http://test.com? or https://test.com?.