Search code examples
sqlregexstringamazon-redshiftregexp-substr

How to use regexp_count with regexp_substr to output multiple matches per string in SQL (Redshift)?


I have a table containing a column with strings. I want to extract all pieces of text in each string that come immediately after a certain substring. For this minimum reproducible example, let's assume this substring is abc. So I want all subsequent terms after abc.

I'm able to achieve this in cases where there is only 1 abc per row, but my logic fails when there are multiple abcs. I'm also getting the number of substring occurrences, but am having trouble relating that to retrieving all of those occurrences.


My approach/attempt:

I created a temp table that contains the # of successful regex matches in my main string:

CREATE TEMP TABLE match_count AS (
SELECT DISTINCT id, main_txt, regexp_count(main_txt, 'abc (\\S+)', 1) AS cnt
FROM my_data_source
WHERE regexp_count(main_txt, 'abc (\\S+)', 1) > 0);

My output:

id   main_txt                         cnt
1    wpfwe abc weiofnew abc wieone    2
2    abc weoin                        1
3    abc weoifn abc we abc w          3

To get my final output, I have a query like:

SELECT id, main_txt, regexp_substr(main_txt, 'abc (\\S+)', 1, cnt, 'e') AS output
FROM match_count;

My actual final output:

id   main_txt                         output
1    wpfwe abc weiofnew abc wieone    wieone
2    abc weoin                        weoin
3    abc weoifn abc we abc w          w

My expected final output:

id   main_txt                         output
1    wpfwe abc weiofnew abc wieone    weiofnew
1    wpfwe abc weiofnew abc wieone    wieone
2    abc weoin                        weoin
3    abc weoifn abc we abc w          weoifn
3    abc weoifn abc we abc w          we
3    abc weoifn abc we abc w          w

So my code only gets the final match (where the occurrence # = cnt). How can I modify it to include every match?


Solution

  • The solutions below do not handle the case where main_text has consecutive occurrences of abc consistently.

    ex.

    wpfwe abc abc abc weiofnew abc wieone

    set up

    CREATE TABLE test_hal_unnest (id int, main_text varchar (500));
    INSERT INTO test_hal_unnest VALUES 
    (1, 'wpfwe abc weiofnew abc wieone'),
    (2, 'abc weoin'),
    (3, 'abc weoifn abc we abc w');
    

    Possible solution by splitting the string into words

    assuming you are searching for all words that comes after the word abc in a string, you don't necessarily have to use regex. regex support in redshift is unfortunately not as full featured as postgres or some other databases. for instance, you can't extract all substrings that match a regex pattern to an array, or split a string to an array based on a regex pattern.

    steps:

    1. split text to array with delimiter ' '
    2. unnest array with ordinality
    3. look up the previous array element using LAG, ordered by the word index
    4. filter rows where the previous word is abc

    the extra columns idx & prev_word are left in the final output to illustrate how the problem is solved. they may be dropped from the final query without issue

    WITH text_split AS (
    SELECT Id
    , main_text
    , SPLIT_TO_ARRAY(main_text, ' ') text_arr
    FROM test_hal_unnest
    )
    , text_unnested AS (
    SELECT ts.id
    , ts.main_text
    , ts.text_arr
    , CAST(ta as VARCHAR) text_word -- converts super >> text
    , idx -- this is the word index
    FROM text_split ts
    JOIN ts.text_arr ta AT idx 
      ON TRUE
    -- ^^ array unnesting happens via joins
    
    )
    , with_prevword AS (
    SELECT id
    , main_text
    , idx
    , text_word
    , LAG(text_word) over (PARTITION BY id ORDER BY idx) prev_word
    FROM text_unnested
    ORDER BY id, idx
    )
    SELECT *
    FROM with_prevword
    WHERE prev_word = 'abc';
    

    output:

     id |           main_text           | idx | text_word | prev_word
    ----+-------------------------------+-----+-----------+-----------
      1 | wpfwe abc weiofnew abc wieone |   2 | weiofnew  | abc
      1 | wpfwe abc weiofnew abc wieone |   4 | wieone    | abc
      2 | abc weoin                     |   1 | weoin     | abc
      3 | abc weoifn abc we abc w       |   1 | weoifn    | abc
      3 | abc weoifn abc we abc w       |   3 | we        | abc
      3 | abc weoifn abc we abc w       |   5 | w         | abc
    (6 rows)
    

    note on unnest array with ordinality

    quoting redshift documentation on this topic, since its kind of hidden

    Amazon Redshift also supports an array index when iterating over the array using the AT keyword. The clause x AS y AT z iterates over array x and generates the field z, which is the array index.

    alternative shorter solution by splitting on abc

    This problem would be more easily solved with regular expression functionality available in redsfhit if instead of

    1, wpfwe abc weiofnew abc wieone
    

    the source data was already split up into multiple rows on abc

    1, wpfwe
    1, abc weiofnew
    1, abc wieone
    

    This solution first expands the source data by splitting on abc. however since split_to_array does not accepts are regular expression pattern, we first inject a delimiter ; before abc, and then split on ;.

    Any delimiter will work, as long as it is guaranteed not to be present in column main_text

    WITH text_array AS (
    SELECT
      id
    , main_text
    , SPLIT_TO_ARRAY(REGEXP_REPLACE(main_text, 'abc ', ';abc '), ';') array
    FROM test_hal_unnest
    )
    SELECT
      ta.id
    , ta.main_text
    , REGEXP_SUBSTR(CAST(st AS VARCHAR), 'abc (\\S+)', 1, 1, 'e') output
    FROM text_array ta
    JOIN ta.array st ON TRUE
    WHERE st LIKE 'abc%';