Search code examples
sqlamazon-web-servicescsvamazon-redshiftdelimiter

How to split data from two columns using two different delimiters in Redshift?


I have a CTE that has data like this. It follows two formats pretty much where counts and process_ids will have these two types of data.

client_id      day              counts      process_ids
--------------------------------------------------------------------------------------------
abc1          Feb-01-2021        3        C1,C2 | C3,C4,C5 | C6,C7
abc2          Feb-05-2021       2, 3      C10,C11,C12 | C13,C14 # C15,C16 | C17,C18

Now I want to get this below output from the above CTE after splitting it out on counts and process_ids -

client_id      day              counts      process_ids
--------------------------------------------------------
abc1           Feb-01-2021        3           C1
abc1           Feb-01-2021        3           C2
abc1           Feb-01-2021        3           C3
abc1           Feb-01-2021        3           C4
abc1           Feb-01-2021        3           C5
abc1           Feb-01-2021        3           C6
abc1           Feb-01-2021        3           C7
abc2           Feb-05-2021        2           C10
abc2           Feb-05-2021        2           C11
abc2           Feb-05-2021        2           C12
abc2           Feb-05-2021        2           C13
abc2           Feb-05-2021        2           C14
abc2           Feb-05-2021        3           C15
abc2           Feb-05-2021        3           C16
abc2           Feb-05-2021        3           C17
abc2           Feb-05-2021        3           C18

Basically, the idea is to split counts and process_ids basis on the below two use cases if they follow any of those formats.

UseCase 1

If counts column only has single-digit and process_ids column has | delimiter.

enter image description here

UseCase 2

If counts column only has two-digit separated by a , delimiter and process_ids column has # delimiter along with pipe.

enter image description here

I am working with Amazon Redshift here and I am confused about how can I split them out as needed.

Is this possible to do by any chance?


Solution

  • This might look a bit hairy at first sight but has been built up from solid techniques and gives the desired result...

    SQL

    WITH seq_0_9 AS (
      SELECT 0 AS d
      UNION ALL SELECT 1 AS d
      UNION ALL SELECT 2 AS d
      UNION ALL SELECT 3 AS d
      UNION ALL SELECT 4 AS d
      UNION ALL SELECT 5 AS d
      UNION ALL SELECT 6 AS d
      UNION ALL SELECT 7 AS d
      UNION ALL SELECT 8 AS d
      UNION ALL SELECT 9 AS d
    ),
    numbers AS (
      SELECT a.d + b.d * 10 + c.d * 100 + 1 AS n
      FROM seq_0_9 a, seq_0_9 b, seq_0_9 c
    ),
    processed AS
      (SELECT client_id,
              day,
              REPLACE(counts, ' ', '') AS counts,
              REPLACE(REPLACE(process_ids, ' ', ''), '|', ',') AS process_ids
       FROM tbl),
    split_pids AS
      (SELECT
         client_id, 
         day,
         counts,
         split_part(process_ids, '#', n) AS process_ids,
         n AS n1
       FROM processed
       CROSS JOIN numbers
       WHERE 
         split_part(process_ids, '#', n) IS NOT NULL
         AND split_part(process_ids, '#', n) != ''),
    split_counts AS
      (SELECT
         client_id, 
         day,
         split_part(counts, ',', n) AS counts,
         process_ids,
         n1,
         n AS n2
       FROM split_pids
       CROSS JOIN numbers
       WHERE
         split_part(counts, ',', n) IS NOT NULL
         and split_part(counts, ',', n) != ''),
    matched_up AS
      (SELECT * FROM split_counts WHERE n1 = n2)
    SELECT
      client_id, 
      day,
      counts,
      split_part(process_ids, ',', n) AS process_ids
    FROM
      matched_up
    CROSS JOIN
      numbers
    WHERE
      split_part(process_ids, ',', n) IS NOT NULL
      AND split_part(process_ids, ',', n) != '';
    

    Demo

    Online rextester demo (using PostgreSQL but should be compatible with Redshift): https://rextester.com/FNA16497

    Brief Explanation

    This technique is used to generate a numbers table (from 1 to 1000 inclusive). This technique is then used multiple times with multiple Common Table Expressions to achieve it in a single SQL statement.