Search code examples
regexoracle-databaseregular-language

limiting number of characters in regexp_like


I am mining text fields for terms such as home births, delivered at home, etc etc. I have code below that mostly works. However, how can I limit the number of characters to limit the number of false positives I might be getting?

I would want lines 1 & 2 but not 3. I am looking to limit the number of characters within the regexp_like statement to 10 or 20 so I would get home delivery or delivery at home.

with test (id, col) as
  (select 1, 'abc 3/4/16 blah blah home  delivery'                          from dual union all
   select 2, 'abc 3/4/16 blah blah 3/7/16 delivery at home xxx cc2'         from dual union all
   select 3, 'xxx 3/5/18 delivery 234 imp happened on 5/8/19 sent home 23f' from dual union all
   select 4, '3/10/18 bla bla imp-3/9/17 xfe 334 3/4/13 x'                  from dual
  )
select * from test 
where     regexp_like(col,'(home|deliver).*(deliv|birth|home)') ;

Thank you-


Solution

  • If you are otherwise satisfied with the matches and performance you are getting,
    you can limit the length of intervening characters between the home|deliver group and deliv|birth|home group by replacing the unbounded .* with a length-limited any-character matcher.

    You can specify the allowed max-length and min-length of intervening characters in the string.
    An example is below, that allows 0 - 30 characters between the groups.

    WITH TEST (ID , COL) AS
        (SELECT 1, 'abc 3/4/16 blah blah home  delivery' FROM DUAL UNION ALL
         SELECT 2, 'abc 3/4/16 blah blah 3/7/16 delivery at home xxx cc2' FROM DUAL UNION ALL
         SELECT 3, 'xxx 3/5/18 delivery 234 imp happened on 5/8/19 sent home 23f' FROM DUAL UNION ALL
         SELECT 4, '3/10/18 bla bla imp-3/9/17 xfe 334 3/4/13 x' FROM DUAL)
    SELECT *
    FROM TEST
    WHERE REGEXP_LIKE(COL , '(home|deliver).{0,30}(deliv|birth|home)');
    

    Result:

       ID                                                     COL
    _____ _______________________________________________________
        1 abc 3/4/16 blah blah home  delivery
        2 abc 3/4/16 blah blah 3/7/16 delivery at home xxx cc2
    
    
    2 rows selected.