Search code examples
regexsnowflake-cloud-data-platformdbt

dbt: regular expression to pick numeric value from text


I am doing transformation using dbt (connected to snowflake). I have a column in a table which contains textual and numeric information. I want to extract the numeric values from the text. Following are the example of the text:

Example 1: '55Mbps downstream/11Mbps upstream'
Example 2: '345 Mbps downstream/115 Mbps upstream'
Example 3: '40Mbps Down / 3Mbps up'
Example 4: '56 Mbps Download / 34 Mbps Upload'

I want to write a regular expression in dbt model that picks the first numeric value appears before Mbps followed by some text and then / and store it in new column. From the above examples, numeric values would be 55, 345, 40, and 56.

WITH rates_transformed AS (
    SELECT
        source_table,
TRY_CAST(
            CASE
                -- Extract numeric value directly before "Mbps" when it's followed by text/space and "/"
                WHEN REGEXP_LIKE(PEAK_INFORMATION_RATE, '\\d+\\s*Mbps[^/]+/') THEN
                    TRY_CAST(REGEXP_SUBSTR(PEAK_INFORMATION_RATE, '(\\d+)\\s*Mbps', 1, 1, 'e') AS FLOAT)

                ELSE -1
            END
        ) AS STD_val
FROM {{ ref('stg_model') }}
    )
    Select * from rates_transformed

However, when I ran the code on above-mentioned examples, it returned -1. Looking for guidance where I made the mistake.


Solution

  • Your regex seems unnecessarily complicated. Simply find the first group of digits next to Mbps that occurs before a '/' character. Try this instead:

    WITH rates_transformed AS (
        SELECT
            source_table,
            COALESCE(REGEXP_SUBSTR(PEAK_INFORMATION_RATE, '(\\d+) ?Mbps.*?/', 1, 1, 'e'), -1) AS STD_val
    FROM {{ ref('stg_model') }}
        )
        Select * from rates_transformed
    
    

    NOTE: The 'e' parameter returns the matched group and 'i' does case insensitive matching, which depending on your use case is likely better, but it may not be if you want to exclude any string other than exactly 'Mbps'. Additionally DBT style guide tells us this should be formatted as such:

    WITH staging_model AS ( SELECT * FROM {{ ref('stg_model') }} ),
    
    rates_transformed AS (
        SELECT
            source_table,
            COALESCE(REGEXP_SUBSTR(PEAK_INFORMATION_RATE, '(\\d+) ?Mbps.*?/', 1, 1, 'e'), -1) AS STD_val
        FROM staging_model 
    )
    
    SELECT * FROM rates_transformed