I'd like to join (e.g. LEFT JOIN
) following two tables upon a match of the main URL substring (ignoring https://www.
)
https://www.example.com/path1/ =matches= https://example.com/path1/
but (http vs https)
http://www.example.com/path1/ =!does not match!= https://example.com/path1/
id | Page | Visits |
---|---|---|
0 | https: // www. example .com/path1/ | 164 |
1 | https: // www. example .com/path2/ | 4685 |
2 | https: // www. example .com/ | 12005 |
id | Page | Visits |
---|---|---|
5 | https: // example .com/path1/ | 546 |
8 | https: // example .com/path2/ | 4689 |
9 | https: // example .com/ | 8466 |
A regex, that captures the main substring is for example (?<=https://)(?:www\.)?(.+)
How would you approach this? What is the proper MySQL syntax and how can it be implemented with SQLAlchemy?
I'm not a SQL specialist and not sure what you want to do but I would do the following:
Decline one table by replacing //www.example.com
by //example.com
or the inverse (//example.com
by //www.example.com
) so that both tables have a similar page URL for your comparaison.
Create an union of this declined table with the other one.
Group by page URL so that we can sum the visits.
The SQL code:
SELECT
Page,
SUM(Visits) AS `Total_Visits`
FROM
(
SELECT
*
FROM
First_Table t1
UNION
(
SELECT
t2.id,
REPLACE(t2.Page, '//example.com', '//www.example.com') AS `Page`,
t2.Visits
FROM
Second_Table t2
)
) Union_Table
GROUP BY
Page
Result:
Page | Total_Visits |
---|---|
https://www.example.com/ | 20471 |
https://www.example.com/path1/ | 710 |
https://www.example.com/path2/ | 9374 |
You could add a sort on the page URL.
Test and play with it here: http://sqlfiddle.com/#!9/d14e11/32