Search code examples
sqlmysqlgreatest-n-per-group

How do I can pull the latest row having the latest row.date, and keeping all other values?


I'm currently trying to extract latest visited domain information from a table that includes a visits history on my websites and information I want to retreive.

Here is my table structure:

CREATE TABLE `visit_record` (
    `id` INT NOT NULL AUTO_INCREMENT,
    `date` VARCHAR( 19 ) NOT NULL,
    `url` VARCHAR( 1024 ) NOT NULL,
    ... EXTRA COLUMNS/DATA ...
    PRIMARY KEY ( `id` )
);

Here URL stands for multiple pages on same or different domains.

But I want to get all columns for the latest row for each distinct domain, and my URL store a full query, not a domain only (for more flexibility regarding other features).

The problem is, my distinct value to group on is a calculated value. I tried mixing the solution from the post shown in examples I tried, with grouping / left joins / different structure of queries / subqueries, but I cannot manage and as soon as I want to add my other values/columns (for e.g ID of the row, extra data), it show 99% of the time the following error:

SELECT list contains nonaggregated column

(even if I found basic queries on internet looking like my-ones).

As I want to group by domain I'm first extracting a domain name as a column, in a query I'll use later with an alias:

SELECT
    `visit_record`.`id`,
    `visit_record`.`date`,
    SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 ), '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
FROM
    `visit_record`;

Here everything is ok, and I see an additional column having my domain name without protocol and query details.

Now I would like to take the latest (MAX) row: the row having the latest date, and get only 1x per unique domain.

I already found part of the solution I guess, thanks to some topics such as the-ones below: Select info from table where row has max date SQL query to get the latest row

But I seems tricky / requiring modifications to fit my use case.

I'm not providing more examples because I made 15+ tries and I don't know which-one is the best to start from, but I guess you have enough information to help me on this!

Data sample of visit_record table:

id      date                    url                                 extra columns
_________________________________________________________________________________
1       2023-06-21 00:00:00     https://example1.com/en             ...
2       2023-06-20 23:00:00     https://example1.com/uk             ...
3       2023-06-20 14:00:00     https://example2.com/en/about       ...
4       2023-06-21 03:00:00     https://example2.com/fr             ...

My expected output:

id      date                    url                                 domain          extra columns
_________________________________________________________________________________________________
1       2023-06-21 00:00:00     https://example1.com/en             example1.com    ...
4       2023-06-21 03:00:00     https://example2.com/fr             example2.com    ...

Thank you in advance for your help.

Problem Reformulation

The main problem was performing this domain extraction from URL and grouping on it, as it is not a primary key, by keeping all columns of the row which is the "latest" (by date) one of each domain.

SOLUTION

Get for each domain which is not a primary key (extracted from an URL value), the latest-row by date (which is not a primary key too), including all the related row columns.

This query will add a last column "domain" and return the result:

SELECT
    t1.*
FROM
    (
        SELECT
            *,
            SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 ), '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
        FROM 
            `visit_record`
   ) t1
INNER JOIN
    (
        SELECT
            domain,
            max( date ) lastdate
        FROM (
                SELECT
                    *,
                    SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 ), '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
                FROM
                    `visit_record`
            ) t0
        GROUP BY domain
    ) t2 ON
        t1.domain = t2.domain
        AND t1.date = t2.lastdate;

Solution

  • Try this

    select * 
    from(
      select *,row_number()over(partition by domain order by date desc) rn
      from (
        SELECT *,
           SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( 
               SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 )
                 , '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
        FROM  `visit_record`
       )t1
      )t2
    where rn=1;
    

    Result

    id date url EXTRA_COLUMNS domain rn
    1 2023-06-21 00:00:00 https://example1.com/en ... example1.com 1
    4 2023-06-21 03:00:00 https://example2.com/fr ... example2.com 1

    Updated:
    if row_number() not available in your version of MySQL, use this example:

    select * 
    from (
        SELECT *,
           SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 ), '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
        FROM  `visit_record`
       )t1 
    inner join (
      select domain,max(date) lastdate
      from (
        SELECT *,
           SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 ), '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
        FROM  `visit_record`
       ) t0
      group by domain
      )t2 on t1.domain=t2.domain and t1.date=t2.lastdate
    ;
    

    Example here