Search code examples
regexp-substr

How to extract specific string until blank space/next line from a text in Oracle?


I am trying to extract the following from the text field using Regrex in Oracle.

  • For example

"This is example,

and this really a example :h,j,j,j,j,

l //Updated question , as this letter is on the next line

now this is a disease:yes"

I am expecting a result as h,j,j,j,j,l, but if I use

REGEXP_SUBSTR(text_field,'example :[^:]+,') AS Result

I am getting example:h,j,j,j,j But I am not getting the last letter 'l' like above and I am guessing that's because it's on the next line.Also, if I want the string "disease:yes" only, that will be so helpful as well. Thank you much!


Solution

  • The result you are getting is because your pattern includes the word 'example' and ends with a comma, leaving out the ending 'l'. Try this form instead. Note the example is shown using a Common table Expression (CTE). The WITH statement creates the table called tbl which just sets up test data, kind of like a temp table. This is also a great way to set up data when asking a question. This form of the REGEXP_SUBSTR() function uses a captured group, which is the set of characters after the string 'example:' until the end of that line in the multi-line field. From this you should be able to get the other string you are after. Give it a go.

    WITH tbl(text_field) AS (
      SELECT 'This is example,
    
    and this really a example :h,j,j,j,j,l
    
    now this is a disease:yes' FROM dual
    )
    SELECT REGEXP_SUBSTR(text_field,'example :(.*)', 1, 1, NULL, 1) AS Result
    FROM tbl;
    
    RESULT     
    -----------
    h,j,j,j,j,l
    
    1 row selected.
    

    Edit based on new info. Since that last letter could be on it's own line, you'll need to allow for the newline. Use the 'n' flag to REGEXP_REPLACE() which allows the newline to match in the usage of the dot (match any character) symbol in regex. We switch to REGEXP_REPLACE as we'll need to return multiple capture groups. Here the WITH sets up 2 rows, one with an embedded newline in the data and one without. The capture groups are (going left to right) 1-the data after "example :" and ending in a comma, 2-the optional newline and 3-the next single character. Then replace the entire data with captured groups 1 and 3 (leaving out the newline). NOTE this is very specific to the case of only 1 character on the following line.

    WITH tbl(ID, text_field) AS ( 
      SELECT 1, 'This is example,
    and this really a example :h,j,j,j,j,
    l
    now this is a disease:yes'  FROM dual UNION ALL
      SELECT 2, 'This is example,
    and this really a example :h,j,j,j,j,l
    now this is a disease:yes'  FROM dual
    ) 
    SELECT ID,
           REGEXP_REPLACE(text_field, '.*example :(.*,)('||CHR(10)||')?(.).*', '\1\3', 1, 1, 'n') AS Result 
    FROM tbl; 
    
    
            ID RESULT      
    ---------- ------------
             1 h,j,j,j,j,l 
             2 h,j,j,j,j,l 
    
    2 rows selected.