Search code examples
regexoracle-databasestring-matching

Regex in oracle


How to get these data I want in Oracle, using REGEXP_SUBSTR

SPRINTMVNO_PM_CDR_IWIRELESS_20121110_0813.csv get '08'in last four digits
RK_IPDR_RKMSG2_0043722_DT_20121113162710.txt  get '0043722' in the middle(between'_')
wireless_201211120015_201211120515            get '0515' (last four digits)

I have tried many times, but some expression works fine in PHP or other language but not work in ORACLE. Maybe the syntax is different.

For example : the second one I can use /(?<=_)[0-9]*(?=_)/ to get the number in php, but this does not work in Oracle.
I tried

SELECT REGEXP_SUBSTR('RK_IPDR_RKMSG2_0043722_DT_20121113162710.txt','(?<=_)[0-9]*(?=_)') 
  FROM dual;

No output. So it's not the problem of the two slash lines

An alternate formulation of this question would be "how to get content between or start with a character but not include it, with Oracle's regex ? "

I know I can get those data easily by using string functions, the problem is there are tons of different strings to handle, each of them have different data to retrieve. So I want to store the patterns into database, and use one regexp_substr to get all data. Otherwise I need to hard code those rules.


Solution

  • Oracle practitioners survived for years without regular expressions because Oracle provides some simple string functions which we can combine for some nifty manipulation.

    For instance, to find the first two characters after the last underscore in a string use SUBSTR() and INSTR() like this:

    with t as (select 'SPRINTMVNO_PM_CDR_IWIRELESS_20121110_0813.csv' str from dual)
    select substr(str, instr(str, '_', -1)+1, 2)
    from t
    /
    

    Note the INSTR() call has a negative offset to start counting from the back. Getting the last four characters of a string employs the same trick:

    with t as (select 'iwireless_201211120015_201211120515' str from dual)
    select substr(str, -4)
    from t
    /
    

    The easiest way to identify a pattern of underscore followed by digits followed by underscore is with a regex but we can use a TRIM() to remove the underscores from the result.

    with t as (select 'RK_IPDR_RKMSG2_0043722_DT_20121113162710.txt' str from dual)
    select trim('_' from regexp_substr(str, '_([0-9]+)_'))
    from t
    /
    

    Here's a SQL Fiddle to prove that these techniques work.

    Oracle has a vast array of functions, which are described in the documentation. Find out more.


    " please ignore the cases, I just need a solution of this 'how to get content between or start with a character but not include it, with Oracle's regex ?'"

    There is a way to exclude characters from the start or end of the result, and that is to break up the search pattern into sub-expressions. This will work for the string you provide, because we can separate the leading and trailing underscores from the required numbers. Unfortunately, the subexpressions parameter is the last parameter in the REGEXP_SUBSTR() signature, and as SQL functions don't accept named parameters this means we have to explicitly pass default values for all the other parameters.

    Anyway, this call will return the second subexpression, which is the desired string, 0043722:

    with t as (select 'RK_IPDR_RKMSG2_0043722_DT_20121113162710.txt' str from dual)
    select regexp_substr(str, '(_)([0-9]+)(_)', 1,1,'i',2)
    from t
    /
    

    The use cases do matter. The REGEXP functions perform slower than the simpler equivalents. In 10gR2 REGEXP_SUBSTR() is at least an order of magnitude slower than SUBSTR(). The difference is noticeable when searching large numbers of strings, and crippling when that number becomes millions (disclosure: recent pain).