Search code examples
sql-serverregext-sqlsql-server-2019

Extract date with varying length from text string


When my data in field1 is formatted consistently as mm/dd/yy, I can extract the date using this:

substring([field1],patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9]%',[field1]),8)

However, sometimes it comes in with only one digit for the day and/or month (if less than 10).

I thought I would be able to catch that by using patindex('%[0-9]+/[0-9]+/[0-9]+%',[field1]), since + should match one or more occurrences (and also tried with the + inside the brackets [0-9+]) but those don't match the pattern.

Is there another method that works?


Solution

  • This is awful and deserves better care of your data. As Joel suggested, preferably, before it ever gets anywhere near the database. You shouldn't be storing dates as strings, or as parts of strings - if this is important data, it should be stored separately, using the correct data type. And it shouldn't be coming into the system in inconsistent, cowboy formats; maybe sometimes we like two digits and maybe sometimes we don't, and with 2-digit years no less. Data modelers everywhere are crying.

    Anywho, since the response is likely going to be "it's not my system, I didn't design it, and I can't change it" - and since SQL Server doesn't support proper RegEx - here's how I would deal with it.

    Create a table with all of your potential formats, the length they might be, and the style you want to convert (1/101 for m/d/y, 3/103 for d/m/y). If it could be either, stop now and go back to the first paragraph because you're hosed. If you don't know if 7/8/23 is July 8 or August 7, no amount of RegEx is going to save you anyway.

    CREATE TABLE dbo.DateFormats
    (
      pattern  varchar(100),
      len      tinyint,
      style    tinyint,
      ordering tinyint
    );
    
    INSERT dbo.DateFormats VALUES
    ('[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]', 10, 101, 1),
    ('[0-9][0-9]/[0-9]/[0-9][0-9][0-9][0-9]',       9, 101, 2),
    ('[0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]',       9, 101, 3),
    ('[0-9]/[0-9]/[0-9][0-9][0-9][0-9]',            8, 101, 4),
    ('[0-9][0-9]/[0-9][0-9]/[0-9][0-9]',            8, 1,   5),
    ('[0-9][0-9]/[0-9]/[0-9][0-9]',                 7, 1,   6),
    ('[0-9]/[0-9][0-9]/[0-9][0-9]',                 7, 1,   7),
    ('[0-9]/[0-9]/[0-9][0-9]',                      6, 1,   8);
    

    Now, consider that you have a table with a bunch of garbage:

    CREATE TABLE dbo.trash
    (
      column1 varchar(255)
    );
    
    INSERT dbo.trash(column1) VALUES
    ('some garbage here 01/03/2022 eh?'),
    ('wow some garbage here too 1/03/2022|huh?'),
    ('and garbage here also 01/3/22\yeah!'),
    ('and also 1/3/2022 yeah-huh and a second one 1/4/2055!'),
    ('2/2/22===and what if it starts that way?'),
    ('12/3/22');
    

    Then we can just use cross apply to find the date string that matches one of the patterns, we process them in a specific order so 12/12/2022 doesn't match m/dd/yyyy first, for example. And we pad each side of the source column in case the value starts or ends with a date, and we pad the pattern with anything that's not a number so we don't find the wrong thing.

    SELECT t.column1, 
      ParsedDate = TRY_CONVERT(date, SUBSTRING(t.column1, p.pos, p.len), p.style)
    FROM dbo.trash AS t
    CROSS APPLY 
    (
      SELECT TOP (1) pos = PATINDEX('%' + pattern + '%', t.column1), len, style
       FROM dbo.DateFormats 
       WHERE 'x' + t.column1 + 'x' LIKE '%[^0-9]' + pattern + '[^0-9]%'
       ORDER BY ordering
    ) AS p;
    

    Results:

    column1 ParsedDate
    some garbage here 01/03/2022 eh? 2022-01-03
    wow some garbage here too 1/03/2022|huh? 2022-01-03
    and garbage here also 01/3/22\yeah! 2022-01-03
    and also 1/3/2022 yeah-huh and a second one 1/4/2055! 2022-01-03
    2/2/22===and what if it starts that way? 2022-02-02
    12/3/22 2022-12-03

    Caveats:

    • Again, if you have a mix of m/d/y and d/m/y, it's over.
    • This solution will only identify the first date matching one of the specified formats.
    • Change CROSS APPLY to OUTER APPLY if you want to include source rows that don't contain a date in any format.