Search code examples
sqloracleselectregexp-substrconnect-by

Oracle: Trying to split a string using REGEXP_SUBSTR, CONNECT BY LEVEL functions and getting duplicates


I have a table that has a column that holds an html string that may or may not be well formed (tried to use xmltable route and it didn't work) which is why I am trying to use the following sql syntax. I am trying to create a query that uses REGEXP_SUBSTR, LEVEL and CONNECT BY LEVEL based on examples I found online but I am not able to get the results I would expect:

I created an example query below:

with qry as (
select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
union
select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
)
SELECT ID,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, LEVEL, NULL, 1) as contents,
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, LEVEL, NULL, 1) as data_jump
FROM qry
CONNECT BY LEVEL <= REGEXP_COUNT(html_string, '<a.*?>(.*?).</a>')

By running this query, I would expect 7 rows to come back:

ID Contents Data_Jump
1 Client Servic 111
1 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Customer Servic 112

Instead its coming back with 22 rows.

ID Contents Data_Jump
1 Client Servic 111
1 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Client Servic 111
1 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114

I don't know where the problem is that I am getting duplicates. On the actual production table there are a lot of rows so the query just spins and never returns. I suspect its because a lot of duplicates are being generated. Can anyone tell me how to fix the above query so that I can try to transport it onto my actual query?

Thanks,


Solution

  • You need to include the ID in the connect-by clause; but for this to wotk you also need to include a non-deterministic function call; e.g.:

    AND PRIOR id = id
    AND PRIOR dbms_random.value IS NOT NULL
    

    So

    with qry as (
    select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
    union
    select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
    )
    SELECT ID,
    REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, LEVEL, NULL, 1) as contents,
    REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, LEVEL, NULL, 1) as data_jump
    FROM qry
    CONNECT BY LEVEL <= REGEXP_COUNT(html_string, '<a.*?>(.*?).</a>')
    AND PRIOR id = id
    AND PRIOR dbms_random.value IS NOT NULL
    
    ID CONTENTS DATA_JUMP
    1 Client Servic 111
    1 Customer Servic 112
    1 Tecnical Servic 113
    2 Client Servic 111
    2 Customer Servic 112
    2 Tecnical Servic 113
    2 Other Servic 114

    db<>fiddle

    Or you could use recursive subquery factoring instead of a hierarchical query, which is maybe easier to understand:

    WITH rcte (id, html_string, lvl, contents, data_jump) AS (
      SELECT ID, html_string, 1,
        REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, 1, NULL, 1),
        REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, 1, NULL, 1)
      FROM qry
      UNION ALL
      SELECT ID, html_string, lvl + 1,
        REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, lvl + 1, NULL, 1),
        REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1)
      FROM rcte
      WHERE REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, lvl + 1, NULL, 1) IS NOT NULL
      OR REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1) IS NOT NULL
    )
    SELECT id, contents, data_jump
    FROM rcte
    ORDER BY id, lvl
    

    db<>fiddle

    Or even, preserving the last character of the 'contents' by removing the stray . from the regex as @Gary_W pointed out:

    with qry as (
    select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
    union
    select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
    ),
    rcte (id, html_string, lvl, contents, data_jump) AS (
      SELECT ID, html_string, 1,
        REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, 1, NULL, 1),
        REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, 1, NULL, 1)
      FROM qry
      UNION ALL
      SELECT ID, html_string, lvl + 1,
        REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, lvl + 1, NULL, 1),
        REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1)
      FROM rcte
      WHERE REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, lvl + 1, NULL, 1) IS NOT NULL
      OR REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1) IS NOT NULL
    )
    SELECT id, contents, data_jump
    FROM rcte
    ORDER BY id, lvl
    
    ID CONTENTS DATA_JUMP
    1 Client Service 111
    1 Customer Service 112
    1 Tecnical Service 113
    2 Client Service 111
    2 Customer Service 112
    2 Tecnical Service 113
    2 Other Service 114

    db<>fiddle