I am trying to capture URLs that have tracking parameters in a query to a website's homepage. There are cases where the parameter can have a forward slash before the query begins. Here are two examples that should match:
https://test.com/?utm_campaign=email
https://test.com?utm_campaign=email
Here are two examples that should not match:
https://test.com/blog
https://test.com/blog?utm_campaign=email
Here is an example query:
SELECT t.url,COUNT(t.id) AS pageviews
FROM db.table AS t
WHERE t.url ~ '^https*:\\/\\/test\\.com\\?'
GROUP BY 1
ORDER BY 2 DESC
Note that Redshift documentation states:
To search for strings that include metacharacters, such as ‘. * | ? ‘, and so on, escape the character using two backslashes (' \\ ')
I have tried both single and double slashes. The single slash returns a lot more than I expect, whereas the double slash does not return any results. I'm more accustomed to writing regex in Javascript, and as such I assume I'm having trouble translating between the two; any help is much appreciated.
The /
symbol is not a special regex metacharacter, you should not escape it. Besides, in order to avoid issues with escaping .
or ?
, you may put them into bracket expressions:
WHERE t.url ~ '^https?://test[.]com[?]'
It will match:
^
- start of stringhttps?://test[.]com[?]
- http://test.com?
or https://test.com?
.