I have a table containing a column with strings. I want to extract all pieces of text in each string that come immediately after a certain substring. For this minimum reproducible example, let's assume this substring is abc
. So I want all subsequent terms after abc
.
I'm able to achieve this in cases where there is only 1 abc
per row, but my logic fails when there are multiple abc
s. I'm also getting the number of substring occurrences, but am having trouble relating that to retrieving all of those occurrences.
My approach/attempt:
I created a temp table that contains the # of successful regex matches in my main string:
CREATE TEMP TABLE match_count AS (
SELECT DISTINCT id, main_txt, regexp_count(main_txt, 'abc (\\S+)', 1) AS cnt
FROM my_data_source
WHERE regexp_count(main_txt, 'abc (\\S+)', 1) > 0);
My output:
id main_txt cnt
1 wpfwe abc weiofnew abc wieone 2
2 abc weoin 1
3 abc weoifn abc we abc w 3
To get my final output, I have a query like:
SELECT id, main_txt, regexp_substr(main_txt, 'abc (\\S+)', 1, cnt, 'e') AS output
FROM match_count;
My actual final output:
id main_txt output
1 wpfwe abc weiofnew abc wieone wieone
2 abc weoin weoin
3 abc weoifn abc we abc w w
My expected final output:
id main_txt output
1 wpfwe abc weiofnew abc wieone weiofnew
1 wpfwe abc weiofnew abc wieone wieone
2 abc weoin weoin
3 abc weoifn abc we abc w weoifn
3 abc weoifn abc we abc w we
3 abc weoifn abc we abc w w
So my code only gets the final match (where the occurrence # = cnt
). How can I modify it to include every match?
The solutions below do not handle the case where main_text
has consecutive occurrences of abc
consistently.
ex.
wpfwe abc abc abc weiofnew abc wieone
CREATE TABLE test_hal_unnest (id int, main_text varchar (500));
INSERT INTO test_hal_unnest VALUES
(1, 'wpfwe abc weiofnew abc wieone'),
(2, 'abc weoin'),
(3, 'abc weoifn abc we abc w');
assuming you are searching for all words that comes after the word abc
in a string, you don't necessarily have to use regex. regex support in redshift is unfortunately not as full featured as postgres or some other databases. for instance, you can't extract all substrings that match a regex pattern to an array, or split a string to an array based on a regex pattern.
steps:
' '
LAG
, ordered by the word indexabc
the extra columns idx
& prev_word
are left in the final output to illustrate how the problem is solved. they may be dropped from the final query without issue
WITH text_split AS (
SELECT Id
, main_text
, SPLIT_TO_ARRAY(main_text, ' ') text_arr
FROM test_hal_unnest
)
, text_unnested AS (
SELECT ts.id
, ts.main_text
, ts.text_arr
, CAST(ta as VARCHAR) text_word -- converts super >> text
, idx -- this is the word index
FROM text_split ts
JOIN ts.text_arr ta AT idx
ON TRUE
-- ^^ array unnesting happens via joins
)
, with_prevword AS (
SELECT id
, main_text
, idx
, text_word
, LAG(text_word) over (PARTITION BY id ORDER BY idx) prev_word
FROM text_unnested
ORDER BY id, idx
)
SELECT *
FROM with_prevword
WHERE prev_word = 'abc';
output:
id | main_text | idx | text_word | prev_word
----+-------------------------------+-----+-----------+-----------
1 | wpfwe abc weiofnew abc wieone | 2 | weiofnew | abc
1 | wpfwe abc weiofnew abc wieone | 4 | wieone | abc
2 | abc weoin | 1 | weoin | abc
3 | abc weoifn abc we abc w | 1 | weoifn | abc
3 | abc weoifn abc we abc w | 3 | we | abc
3 | abc weoifn abc we abc w | 5 | w | abc
(6 rows)
quoting redshift documentation on this topic, since its kind of hidden
Amazon Redshift also supports an array index when iterating over the array using the AT keyword. The clause x AS y AT z iterates over array x and generates the field z, which is the array index.
abc
This problem would be more easily solved with regular expression functionality available in redsfhit if instead of
1, wpfwe abc weiofnew abc wieone
the source data was already split up into multiple rows on abc
1, wpfwe
1, abc weiofnew
1, abc wieone
This solution first expands the source data by splitting on abc. however since split_to_array does not accepts are regular expression pattern, we first inject a delimiter ;
before abc
, and then split on ;
.
Any delimiter will work, as long as it is guaranteed not to be present in column main_text
WITH text_array AS (
SELECT
id
, main_text
, SPLIT_TO_ARRAY(REGEXP_REPLACE(main_text, 'abc ', ';abc '), ';') array
FROM test_hal_unnest
)
SELECT
ta.id
, ta.main_text
, REGEXP_SUBSTR(CAST(st AS VARCHAR), 'abc (\\S+)', 1, 1, 'e') output
FROM text_array ta
JOIN ta.array st ON TRUE
WHERE st LIKE 'abc%';