Search code examples
sqloracle-databasegroup-byclob

GROUP BY invloving a CLOB data


There is join between three tables, test_3, test_2 and test_1.

test_1 and test_3 are main tables and have no common columns. there are joined by table test_2. test_1 has sr_id, last_updated_date,
test_2 has sr_id and sm_id and test_3 has sm_id,sql_statement. test_3 has clob data which causing all the trouble.

I have to find the latest sr_id associated with sm_id. My idea was to use an aggregate function max(last_updated_date) and group it by. And it doesnt happen for many reasons.

  1. It contains CLOB data the column is sql_statement.

  2. I have used a join which I am not familiar with.

Any ideas would be helpful.

WITH xx as (
    (select  ANSWER ,sr_id AS ID from test 
    WHERE Q_ID in (SELECT Q_ID FROM test_2 WHERE field_id='LM_LRE_Q6')
    ) 
)
-- end of source data


SELECT t.ID, t1.n, t1.SM_ID,seg_dtls.SEGMENTation_NAME ,to_char(mst.LAST_UPDATED_DATE,'dd-mon-yyyy hh24:mi:ss'),seg_dtls.sql_statement
FROM xx t
CROSS JOIN LATERAL (
        select LEVEL AS n, regexp_substr( t.answer, '\d+',  1, level) as SM_ID
        from dual
        connect by regexp_substr( t.answer, '\d+',  1, level) IS NOT NULL
) t1
left join test_1 mst 
on mst.sr_id=t.id
right join test_3 seg_dtls
on seg_dtls.sm_id=t1.sm_id;

The sample data would look like

sr_id   sm_id SEGMENTATION_NAME  LAST_UPDATED_DATE  
1108197 958   test_not_in          05-feb-2017 23:56:59    
1108217 958   test_not_in          14-feb-2017 00:37:39  
1108218 958   test_not_in          14-feb-2017 01:39:50  
1108220 958   test_not_in          14-feb-2017 03:39:07  

and the expected output is

1108220 958   test_not_in          14-feb-2017 03:39:07  

I am not posting CLOB data because it is huge. Every row contains CLOB data.

table test_3 contains  
q_id     sr_id  answer   
1009330 1108246 976~feb_24^941~Test_regionwithcountry  
1009330 1108247 941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24  
1009330 1108239 972~test_emea  
1009330 1108240 972~test_emea^827~test_with_region_country  
1009330 1108251 981~MSE100579729 testing.

and sample data looks like above of test_3
Answer contains sm_id. I have to pull it from here.
for example:

941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24  
the sm_id is 941,787,976 

.

So I had come with the above query posted above.
Again, coming to left and right joins, all sm_id from test_3 is needed so i used the right join here.

edit1 : The accepted answer gives the SR_ID OF SEGMENTS with max(last_updated_date).
I would need all SR_ID. So, I used MINUS operator to get the ones that are not max(last_updated_date).
I need to append that result set to accepted answer.

This is what I did to get the other SR_IDs.

select sr_id,segmentation_name,request_status from (with test_31 (q_id, sr_id, answer) as (
 (SELECT Q_ID,SR_ID,ANSWER FROM test_3 WHERE Q_ID=(SELECT Q_ID FROM test_4 WHERE FIELD_ID='LM_LRE_Q6'))
),
answer_extraction as (
  select q_id, sr_id,
    regexp_substr(regexp_substr(answer, '[^^]+', 1, level),'\d+') as sm_id
  from test_31
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '[^^]+', 1, level) is not null
)
select sr_id,
  sm_id,
  segmentation_name,
  LAST_UPDATED_DATE,
  sql_statement,request_status
from (
  select t1.sr_id,
    t2.sm_id,
    t2.segmentation_name,
    t1.last_updated_date,
    t2.sql_statement,
    t1.request_status

  from test_4 t4
  join answer_extraction t3 on t3.q_id = t4.q_id
  join test_2 t2 on t2.sm_id = t3.sm_id
  join test1 t1 on t1.sr_id = t3.sr_id
)
)
minus

(select  sr_id,segmentation_name , request_status from (with test_31 (q_id, sr_id, answer) as (
 (SELECT Q_ID,SR_ID,ANSWER FROM test_3 WHERE Q_ID=(SELECT Q_ID FROM test_4 WHERE FIELD_ID='LM_LRE_Q6'))
),
answer_extraction as (
  select q_id, sr_id,
    regexp_substr(regexp_substr(answer, '[^^]+', 1, level), '\d+') as sm_id
  from test_31
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '[^^]+', 1, level) is not null
)
select sr_id,
  segmentation_name,
  sql_statement,
   request_status
from (
  select t1.sr_id,
    t2.sm_id,
    t2.segmentation_name,
    t1.last_updated_date,
    t2.sql_statement,
     t1.request_status,
    max(t1.last_updated_date) over (partition by t2.sm_id) as max_updated_date
  from test_4 t4
  join answer_extraction t3 on t3.q_id = t4.q_id
  join test_2 t2 on t2.sm_id = t3.sm_id
  join test_1 t1 on t1.sr_id = t3.sr_id
)
where last_updated_date = max_updated_date));

}

sample data:
Accepted answer gives below output with max(last_updated_date) of the segment.

1097661 Submitted   o2k lad 30-NOV-15   01-DEC-16   62  CLOB DATA  

above posted query GIVES below output which is sr_id of segments with other updated dates.

 1097621    o2k lad Submitted
    1097625 o2k lad Submitted
    1097627 o2k lad Submitted
    1097632 o2k lad Submitted
    1097633 o2k lad Submitted
    1097658 o2k lad Pending
    1097640 o2k lad Submitted
    1097644 o2k lad Submitted
    1097646 o2k lad Submitted

expected output:

  sr_id status     segment_name updated_date sql_statement other_sr_id
1097661 Submitted   o2k lad     30-NOV-15     CLOB DATA 1097618,1097621,1097625,1097627,1097632,1097633,1097658,1097640,1097644,1097646

combine the two queries so that last column contains all old sr_id.


Solution

  • A fairly simple option is to modify your current query to add an analytic function that finds the maximum date for each ID, as something like:

    ..., max(mst.last_updated_date) over (partition by id) as max_updated_date
    

    A quick demo of the general idea:

    with cte (id, last_updated_date, sql_statement) as (
      select 1, date '2017-01-01', to_clob('stmt 1') from dual
      union all select 1, date '2017-01-02', to_clob('stmt 2') from dual
      union all select 1, date '2017-01-03', to_clob('stmt 3') from dual
      union all select 2, date '2017-01-02', to_clob('stmt 4') from dual
    )
    select id, last_updated_date, sql_statement
    from (
      select id, last_updated_date, sql_statement,
        max(last_updated_date) over (partition by id) as max_updated_date
      from cte
    )
    where last_updated_date = max_updated_date;
    
            ID LAST_UPDAT SQL_STATEMENT                                                                   
    ---------- ---------- --------------------------------------------------------------------------------
             1 2017-01-03 stmt 3                                                                          
             2 2017-01-02 stmt 4                                                                          
    

    You could use a row_number() or rank() or dense_rank() to identify the row with the earliest date and filter on that instead, but the general idea is the same.

    However, your current query isn't very clear (or valid before 12c) to begin with. Rather than trying to guess how to include such a function and filter, it's probably simpler to start again from your base tables, though this makes a lot of assumptions about what you're doing, and may ignore some things - like left and right joins - that may or may not actually be needed.

    Making up some data via CTEs:

    with test_1 (sr_id, last_updated_date) as (
      select 1108197, timestamp '2017-02-05 23:56:59' from dual
      union all select 1108217, timestamp '2017-02-14 00:37:39' from dual
      union all select 1108218, timestamp '2017-02-14 01:39:50' from dual
      union all select 1108220, timestamp '2017-02-14 03:39:07' from dual
    ),
    test_2 (sm_id, segmentation_name, sql_statement) as (
      select 958, 'test_not_in', to_clob('select * from dual') from dual
    ),
    test_3 (q_id, sr_id, answer) as (
      select 41, 1108197, 958 from dual
      union all select 42, 1108217, 958 from dual
      union all select 43, 1108218, 958 from dual
      union all select 44, 1108220, 958 from dual
    ),
    test_4 (q_id, field_id) as (
      select 41, 'LM_LRE_Q6' from dual
      union all select 42, 'LM_LRE_Q6' from dual
      union all select 43, 'LM_LRE_Q6' from dual
      union all select 44, 'LM_LRE_Q6' from dual
    )
    

    then this gets the same output you showed in the question:

    select t1.sr_id,
      t2.sm_id,
      t2.segmentation_name,
      to_char(t1.last_updated_date, 'dd-mon-yyyy hh24:mi:ss') as last_updated_date,
      t2.sql_statement
    from test_4 t4
    join test_3 t3 on t3.q_id = t4.q_id
    join test_2 t2 on t2.sm_id = t3.answer
    join test_1 t1 on t1.sr_id = t3.sr_id;
    
         SR_ID SM_ID SEGMENTATIO LAST_UPDATED_DATE             SQL_STATEMENT                                                                   
    ---------- ----- ----------- ----------------------------- --------------------------------------------------------------------------------
       1108197   958 test_not_in 05-feb-2017 23:56:59          select * from dual                                                              
       1108217   958 test_not_in 14-feb-2017 00:37:39          select * from dual                                                              
       1108218   958 test_not_in 14-feb-2017 01:39:50          select * from dual                                                              
       1108220   958 test_not_in 14-feb-2017 03:39:07          select * from dual                                                              
    

    On the wild assumption that is close to right, you could find the row with the latest date for each sm_id with something like this:

    select sr_id,
      sm_id,
      segmentation_name,
      to_char(last_updated_date, 'dd-mon-yyyy hh24:mi:ss') as last_updated_date,
      sql_statement
    from (
      select t1.sr_id,
        t2.sm_id,
        t2.segmentation_name,
        t1.last_updated_date,
        t2.sql_statement,
        max(t1.last_updated_date) over (partition by t2.sm_id) as max_updated_date
      from test_4 t4
      join test_3 t3 on t3.q_id = t4.q_id
      join test_2 t2 on t2.sm_id = t3.answer
      join test_1 t1 on t1.sr_id = t3.sr_id
    )
    where last_updated_date = max_updated_date;
    
         SR_ID SM_ID SEGMENTATIO LAST_UPDATED_DATE             SQL_STATEMENT                                                                   
    ---------- ----- ----------- ----------------------------- --------------------------------------------------------------------------------
       1108220   958 test_not_in 14-feb-2017 03:39:07          select * from dual                                                              
    

    You will need to adapt that to deal with any other restrictions or requirements that weren't clear (including your left/right outer joins, for instance).

    I've deliberately ignored the the subquery you were doing to split the 'answer' into multiple values. It's possible you have something horrible like a delimited list of IDs in there, which is a data model problem. If that is the case then you'll still need to extract the individual sm_id values; something like:

    with answer_extraction as (
      select q_id, sr_id, regexp_substr(answer, '\d+', 1, level) as sm_id
      from test_3
      connect by q_id = prior q_id
      and sr_id = prior sr_id
      and prior dbms_random.value is not null
      and regexp_substr(answer, '\d+', 1, level) is not null
    )
    select sr_id,
      sm_id,
      segmentation_name,
      to_char(last_updated_date, 'dd-mon-yyyy hh24:mi:ss') as last_updated_date,
      sql_statement
    from (
      select t1.sr_id,
        t2.sm_id,
        t2.segmentation_name,
        t1.last_updated_date,
        t2.sql_statement,
        max(t1.last_updated_date) over (partition by t2.sm_id) as max_updated_date
      from test_4 t4
      join answer_extraction t3 on t3.q_id = t4.q_id
      join test_2 t2 on t2.sm_id = t3.sm_id
      join test_1 t1 on t1.sr_id = t3.sr_id
    )
    where last_updated_date = max_updated_date;
    

    Based on the actual contents of test3 you added, your regular expression isn't doing quite what you need. With the pattern you're using it finds 14 numeric values, i.e. any numbers:

    with test_3 (q_id, sr_id, answer) as (
      select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual
      union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual
      union all select 1009330, 1108239, '972~test_emea' from dual
      union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual
      union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual
    ),
    answer_extraction as (
      select q_id, sr_id, regexp_substr(answer, '\d+', 1, level) as sm_id
      from test_3
      connect by q_id = prior q_id
      and sr_id = prior sr_id
      and prior dbms_random.value is not null
      and regexp_substr(answer, '\d+', 1, level) is not null
    )
    select * from answer_extraction;
    
          Q_ID      SR_ID SM_ID     
    ---------- ---------- ----------
       1009330    1108239 972       
       1009330    1108240 972       
       1009330    1108240 827       
       1009330    1108246 976       
       1009330    1108246 24        
       1009330    1108246 941       
       1009330    1108247 941       
       1009330    1108247 2016      
       1009330    1108247 787       
       1009330    1108247 28        
       1009330    1108247 976       
       1009330    1108247 24        
       1009330    1108251 981       
       1009330    1108251 100579729 
    

    It appears you only want the bits between the ^ delimiters and the ~ markers. A common way to split a delimited string is:

    with test_3 (q_id, sr_id, answer) as (
      select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual
      union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual
      union all select 1009330, 1108239, '972~test_emea' from dual
      union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual
      union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual
    ),
    answer_extraction as (
      select q_id, sr_id, regexp_substr(answer, '[^^]+', 1, level) as sm_id
      from test_3
      connect by q_id = prior q_id
      and sr_id = prior sr_id
      and prior dbms_random.value is not null
      and regexp_substr(answer, '[^^]+', 1, level) is not null
    )
    select * from answer_extraction;
    
          Q_ID      SR_ID SM_ID                                   
    ---------- ---------- ----------------------------------------
       1009330    1108239 972~test_emea                           
       1009330    1108240 972~test_emea                           
       1009330    1108240 827~test_with_region_country            
       1009330    1108246 976~feb_24                              
       1009330    1108246 941~Test_regionwithcountry              
       1009330    1108247 941~Test_regionwithcountry_2016         
       1009330    1108247 787~Test_Request_28                     
       1009330    1108247 976~feb_24                              
       1009330    1108251 981~MSE100579729 testing.               
    

    but you then need to get the first part of that, e.g. borrowing your original pattern (others are available!):

    column sm_id format a10
    with test_3 (q_id, sr_id, answer) as (
      select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual
      union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual
      union all select 1009330, 1108239, '972~test_emea' from dual
      union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual
      union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual
    ),
    answer_extraction as (
      select q_id, sr_id,
        regexp_substr(regexp_substr(answer, '[^^]+', 1, level), '\d+') as sm_id
      from test_3
      connect by q_id = prior q_id
      and sr_id = prior sr_id
      and prior dbms_random.value is not null
      and regexp_substr(answer, '[^^]+', 1, level) is not null
    )
    select * from answer_extraction;
    
          Q_ID      SR_ID SM_ID     
    ---------- ---------- ----------
       1009330    1108239 972       
       1009330    1108240 972       
       1009330    1108240 827       
       1009330    1108246 976       
       1009330    1108246 941       
       1009330    1108247 941       
       1009330    1108247 787       
       1009330    1108247 976       
       1009330    1108251 981       
    

    Notice that extra regexp_substr() is only in the select list, not the connect-by clause; and that the extract sm_id is still a string. If test_2.sm_id is a number then add a to_number() call around the pair of substrings in that select list too.