Search code examples
sqldatetimereplacesplitsnowflake-cloud-data-platform

How to extract convert date in written text to numeric


I've a problem where I from a product description need to convert the months of the description to a billing period. Example description: Transaction packages 2500 transactions / year included (JUNE-MAY)

From this I would like to extract START_BILLING = 2024-06-01 and END_BILLING 2025-05-01, based on the fact that we start in June and end in may. I can find the year of this from other items in the database but I don't know of any nice functions to do this easily.

Any suggestions? I work in snowflake

Been trying around with regexs and different string manipulators but haven't found a easy clean pick yet


Solution

  • So with the input data as you describe, you can ether REGEXP the two parts:

    with data(description) as (
        select * from values
        ('Transaction packages 2500 transactions / year included (JUNE-MAY)')
    )
    select 
        description,
        regexp_substr(description,'\\(([A-Z]+)-([A-Z]+)\\)',1,1,'ce',1) as start_s,
        regexp_substr(description,'\\(([A-Z]+)-([A-Z]+)\\)',1,1,'ce',2) as end_s
    from data;
    

    enter image description here

    OR grap the whole bundle and string split on the -

    Then decode the dates to dates. I am a little concerned your end month is the first day of the month not the last, but if we ignore that:

    with data(description, year_from_elsewhere) as (
        select * from values
        ('Transaction packages 2500 transactions / year included (JUNE-MAY)', 2024)
    )
    select 
        description,
        regexp_substr(description,'\\(([A-Z]+)-([A-Z]+)\\)',1,1,'ce',1) as start_s,
        regexp_substr(description,'\\(([A-Z]+)-([A-Z]+)\\)',1,1,'ce',2) as end_s,
        decode(start_s,
            'JANUARY', 1, 
            'MAY', 5, 
            'JUNE', 6 
            /* complete the 12 */
        ) as start_m,
        decode(end_s,
            'JANUARY', 1, 
            'MAY', 5, 
            'JUNE', 6 
            /* complete the 12 */
        ) as end_m,
        date_from_parts(year_from_elsewhere, start_m, 1) as start_date,
        date_from_parts(year_from_elsewhere+1, end_m, 1) as end_date
        
    from data;
    

    gives:

    enter image description here

    This has a hack assuming end is in the next year, which is incorrect for JAN-DEC

    another alternative if END is always 11 months after the start:

    dateadd('month', 11, start_date) as end_alternate
    

    OR if you want correct

        date_from_parts(year_from_elsewhere+iff(end_m < start_m, 1, 0), end_m, 1) as correct_end_date,
    
    with data(description, year_from_elsewhere) as (
        select * from values
        ('Transaction packages 2500 transactions / year included (JUNE-MAY)', 2024),
        ('blar blar (JANUARY-DEC)', 2024),
        ('more blar (JUNE-SEP)', 2024)
    )
    select 
        description,
        regexp_substr(description,'\\(([A-Z]+)-([A-Z]+)\\)',1,1,'ce',1) as start_s,
        regexp_substr(description,'\\(([A-Z]+)-([A-Z]+)\\)',1,1,'ce',2) as end_s,
        decode(start_s,
            'JANUARY', 1, 
            'MAY', 5, 
            'JUNE', 6,
            'SEP', 9,
            'DEC', 12
            /* complete the 12 */
        ) as start_m,
        decode(end_s,
            'JANUARY', 1, 
            'MAY', 5, 
            'JUNE', 6,
            'SEP', 9,
            'DEC', 12
            /* complete the 12 */
        ) as end_m,
        date_from_parts(year_from_elsewhere, start_m, 1) as start_date,
        --date_from_parts(year_from_elsewhere+1, end_m, 1) as bad_end_date,
        --dateadd('month', 11, start_date) as end_alternate,
        date_from_parts(year_from_elsewhere+iff(end_m < start_m, 1, 0), end_m, 1) as correct_end_date,
    from data;
    

    enter image description here