Search code examples
mysqlregexjoinsqlalchemysubstring

MySQL / SQLAlchemy — JOIN on REGEX substring match between columns


I'd like to join (e.g. LEFT JOIN) following two tables upon a match of the main URL substring (ignoring https://www.)

  1. first just using MySQL
  2. then also implemented in SQLAlchemy

For example:

https://www.example.com/path1/  =matches=  https://example.com/path1/

but (http vs https)

http://www.example.com/path1/  =!does not match!=  https://example.com/path1/

First_Table

id Page Visits
0 https: // www. example .com/path1/ 164
1 https: // www. example .com/path2/ 4685
2 https: // www. example .com/ 12005

Second_Table

id Page Visits
5 https: // example .com/path1/ 546
8 https: // example .com/path2/ 4689
9 https: // example .com/ 8466

A regex, that captures the main substring is for example (?<=https://)(?:www\.)?(.+)

How would you approach this? What is the proper MySQL syntax and how can it be implemented with SQLAlchemy?


Solution

  • I'm not a SQL specialist and not sure what you want to do but I would do the following:

    1. Decline one table by replacing //www.example.com by //example.com or the inverse (//example.com by //www.example.com) so that both tables have a similar page URL for your comparaison.

    2. Create an union of this declined table with the other one.

    3. Group by page URL so that we can sum the visits.

    The SQL code:

    SELECT
      Page,
      SUM(Visits) AS `Total_Visits`
    FROM
      (
        SELECT
          *
        FROM
          First_Table t1
        UNION
          (
            SELECT
              t2.id,
              REPLACE(t2.Page, '//example.com', '//www.example.com') AS `Page`,
              t2.Visits
            FROM
              Second_Table t2
          )
      ) Union_Table
    GROUP BY
      Page
    

    Result:

    You could add a sort on the page URL.

    Test and play with it here: http://sqlfiddle.com/#!9/d14e11/32