Search code examples
sqlregexgoogle-bigqueryre2

regex to extract sentence to the left of a word but not include words with dashes as whole words


I am trying to extract everything to the left of pin_delim. This works fine for id 1,2,3,6 but fails for 5. id 4 is expected to return no results. I can figure out how to stop pin_delim spliting on a dash -?

Any ideas on how to fix this and why is - being treated like a space?

I always want the last pin_delim string to the left.

The university of de pere,delaware,de,usa

RETURN 'The university of de pere,delaware,'
NOT 'The university of '

saint-vincent-de-paul,delaware,de,usa

RETURN 'saint-vincent-de-paul,delaware,de'
NOT 'saint-vincent-'

I have tried the following

 WITH t2 AS (
    WITH t1 AS (
    SELECT 1 id,'middleDEword French,de, Polynesia.' pin_senetence,'de' pin_delimiter,'middleDEword French,' expected
    UNION ALL
    SELECT 2 id, 'Saint-Vincent-de-Paul,de,usa','de','Saint-Vincent-de-Paul,'
    UNION ALL
    SELECT 3 id,'HopiDEtal-de Saint Vincent de Paul,de,usa','de','HopiDEtal-de Saint Vincent de Paul,'
    UNION ALL
    SELECT 4 id,'middleDEword French, Polynesia.' pin_snetence,'de','middleDEword French, Polynesia.'
    UNION ALL
    SELECT 5 id,'Saint-Vincent-de-Paul,usa','de','Saint-Vincent-de-Paul,usa'
    UNION ALL
    SELECT 6 id,'HopiDEtal-de Saint DE Vincent de Paul,usa','de','HopiDEtal-de Saint DE Vincent ')
SELECT id,
    pin_senetence,
    pin_delimiter,
    expected,
    LENGTH(pin_delimiter) substr_sec_param,
    LENGTH(REGEXP_EXTRACT(pin_senetence, CONCAT( r'(?i)\b',  r'^(.*)', pin_delimiter,  r'\b.*'))) AS pos
    FROM t1
    WHERE 1 = 1
    AND REGEXP_CONTAINS(pin_senetence, CONCAT( r'(?i)\b',  r'^(.*)', pin_delimiter,  r'\b.*')))
    SELECT id,
    pin_senetence,
    SUBSTR(pin_senetence, pos + 1, substr_sec_param)          substr_val,
    SUBSTR(pin_senetence, 1, pos) actual,
    expected,
    IF(SUBSTR(pin_senetence, 1, pos) = expected, TRUE, FALSE) is_correct,
    pos,
    substr_sec_param
    FROM t2;


|id |pin_senetence                            |substr_val|actual                             |expected                           |is_correct|pos|substr_sec_param|
|---|-----------------------------------------|----------|-----------------------------------|-----------------------------------|----------|---|----------------|
|1  |middleDEword French,de, Polynesia.       |de        |middleDEword French,               |middleDEword French,               |TRUE      |20 |2               |
|2  |Saint-Vincent-de-Paul,de,usa             |de        |Saint-Vincent-de-Paul,             |Saint-Vincent-de-Paul,             |TRUE      |22 |2               |
|3  |HopiDEtal-de Saint Vincent de Paul,de,usa|de        |HopiDEtal-de Saint Vincent de Paul,|HopiDEtal-de Saint Vincent de Paul,|TRUE      |35 |2               |
|5  |Saint-Vincent-de-Paul,usa                |de        |Saint-Vincent-                     |Saint-Vincent-de-Paul,usa          |FALSE     |14 |2               |
|6  |HopiDEtal-de Saint DE Vincent de Paul,usa|de        |HopiDEtal-de Saint DE Vincent      |HopiDEtal-de Saint DE Vincent      |TRUE      |30 |2               |


Solution

  • I suggest to reverse the both strings, the text and the token pin_delimiter. In the regular expression we can search for the first occurance of the token "ed" (reversed "de") and take everyting after this. Reversing this result is the beginning of the string until the last occurance of the token.

    For excluding - as token separator, please exclude these in [^\w\-].

        WITH t1 AS (
        SELECT 1 id,'middleDEword French,de, Polynesia.' pin_senetence,'de' pin_delimiter,'middleDEword French,' expected
        UNION ALL
        SELECT 2 id, 'Saint-Vincent-de-Paul,de,usa','de','Saint-Vincent-de-Paul,'
        UNION ALL
        SELECT 3 id,'HopiDEtal-de Saint Vincent de Paul,de,usa','de','HopiDEtal-de Saint Vincent de Paul,'
        UNION ALL
        SELECT 4 id,'middleDEword French, Polynesia.' pin_snetence,'de','middleDEword French, Polynesia.'
        UNION ALL
        SELECT 5 id,'Saint-Vincent-de-Paul,usa','de','Saint-Vincent-de-Paul,usa'
        UNION ALL
        SELECT 6 id,'HopiDEtal-de Saint DE Vincent de Paul,usa','de','HopiDEtal-de Saint DE Vincent ')
    SELECT *,
    #substring(pin_senetence,1,length(pin_senetence)-length(text_ending)+2) as extracted,
    substring(text_max,1,length(text_max)-length(pin_delimiter)-1) as text_extracted,
    substring(text_max,1,length(text_max)-length(pin_delimiter)-1)=expected
    from(
    Select *,
    #REGEXP_EXTRACT(concat(pin_senetence," "),CONCAT(r"([^\w\-]",pin_delimiter,r"[^\w\-].*)")) as text_ending,
    reverse(REGEXP_EXTRACT(reverse(concat(pin_senetence," ")),CONCAT(r"(?i)([^\w\-]",reverse(pin_delimiter),r"[^\w\-].*)"))) as text_max
    from t1
    )