Search code examples
mysqlregexp-substr

Use of REGEXP_SUBSTR to get date values from string


I'm looking for the REGEXP_SUBSTR code that gets dates like format '06-11-2014 - 05-12-2014' or format '01/11/2019 - 30/11/2019' from a string. The first date being the startdate and the second date being the enddate. It would be extremely helpful to understand how the REGEXP_SUBSTR works in this case and also why. I want to get the string with the two dates, but then I want both dates to be in their own column.

A record look likes this:

Medium - nl (06-11-2014 - 05-12-2014) ruimte: Standaard (5.000 MB).

Although text can be shorter or longer the two dates between brackets are always there.

The code below gets the first one, but only if it's with '-'. I want both '-' and '/' variants displayed.

REGEXP_SUBSTR(description, '[0-9][0-9][-[0-9][0-9]-[0-9][0-9][0-9][0-9]')

Thanks a lot for any and all help.


Solution

  • Since you are using MySQL 8+, it means you also have access to the REGEXP_REPLACE function, which is suitable for isolating the portion of the string which contains the two dates. In the CTE below, I isolate the date string, then in a subquery on that CTE, I fish out the two dates in separate columns using SUBSTRING_INDEX.

    WITH cte AS (
        SELECT
            text,
            REGEXP_REPLACE(text, '^.*\(([0-9]{2}-[0-9]{2}-[0-9]{4} - [0-9]{2}-[0-9]{2}-[0-9]{4})\).*$', '$1') AS dates
        FROM yourTable
    )
    
    SELECT
        text,
        SUBSTRING_INDEX(dates, ' - ', 1) AS first_date,
        SUBSTRING_INDEX(dates, ' - ', -1) AS second_date
    FROM cte;
    

    Demo

    Here is an explanation of the regex pattern used:

    ^                                   from the start of the string
        .*                              match any content, until hitting
        \(                              '(' which is followed by
        (                               (capture what follows)
            [0-9]{2}-[0-9]{2}-[0-9]{4}  a single date
             -                          -
            [0-9]{2}-[0-9]{2}-[0-9]{4}  another single date
        )                               (stop capture)
        \)                              ')'
        .*                              match the remainder of the content
    $                                   end of the string
    

    Note that we include a pattern which matches the entire input, which is a requirement since we want to use a capture group. Also, note that REGEXP_SUBSTR might have been viable here, but it could run the risk that you get false positives, in the event that a date could appear elsewhere besides the terms in parentheses.