Search code examples
sqlregexsnowflake-cloud-data-platformregexp-replace

REGEXP_REPLACE Strings Starting and Ending with Specific Substrings in Snowflake


I am trying to create a column in a view in Snowflake that replaces any string between strings that I care about with nothing.

This is essentially for the purpose of stripping html formatting out of text. As an example:

<ul>
<li>Text I care about 1
<li>Text I care about 2</li>
<li>Text I care about 3</li>
</ul>

Would should end up like this:


Text I care about 1
Text I care about 2
Text I care about 3

Based on the patterns I am seeing, I think that if I can eliminate any string starting with &lt, and ending with >, I should be able to achieve the result I am looking for.

In testing on different sites it seems like expression REGEXP_REPLACE(originaltext, '&lt.+?>','') should, work, but when attempting in Snowflake it seems to be cutting off the last 'Text I care about' in some cases, and in other cases just is not showing any results at all. I am not sure if there is a syntax difference or something else off in the version of regex snowflake is using, but any advice would be appreciated.


Solution

  • Your regular expression works, but it requires lookarounds.

    set sample1 = '<ul>';
    set sample2 = '<li>Text I care about 1';
    set sample3 = '<li>Text I care about 2</li>';
    set sample4 = '<li>Text I care about 3</li>';
    set sample5 = '</ul>';
    
    select regexp_replace2($SAMPLE1,'&lt.+?>','');  
    select regexp_replace2($SAMPLE2,'&lt.+?>','');
    select regexp_replace2($SAMPLE3,'&lt.+?>','');
    select regexp_replace2($SAMPLE4,'&lt.+?>','');
    select regexp_replace2($SAMPLE5,'&lt.+?>','');
    

    I wrote a UDF library that supports regular expression lookarounds. It attempts to approximate the built-in Snowflake regular expression functions while supporting lookarounds. The names of the UDFs are the same as the built-in regular expression functions with the suffix "2" as shown in the SQL sample.

    https://github.com/GregPavlik/SnowflakeUDFs/tree/main/RegularExpressions