I am mining text fields for terms such as home births, delivered at home, etc etc. I have code below that mostly works. However, how can I limit the number of characters to limit the number of false positives I might be getting?
I would want lines 1 & 2 but not 3. I am looking to limit the number of characters within the regexp_like statement to 10 or 20 so I would get home delivery or delivery at home.
with test (id, col) as
(select 1, 'abc 3/4/16 blah blah home delivery' from dual union all
select 2, 'abc 3/4/16 blah blah 3/7/16 delivery at home xxx cc2' from dual union all
select 3, 'xxx 3/5/18 delivery 234 imp happened on 5/8/19 sent home 23f' from dual union all
select 4, '3/10/18 bla bla imp-3/9/17 xfe 334 3/4/13 x' from dual
)
select * from test
where regexp_like(col,'(home|deliver).*(deliv|birth|home)') ;
Thank you-
If you are otherwise satisfied with the matches and performance you are getting,
you can limit the length of intervening characters between the home|deliver
group and deliv|birth|home
group by replacing the unbounded .*
with a length-limited any-character matcher.
You can specify the allowed max-length and min-length of intervening characters in the string.
An example is below, that allows 0 - 30 characters between the groups.
WITH TEST (ID , COL) AS
(SELECT 1, 'abc 3/4/16 blah blah home delivery' FROM DUAL UNION ALL
SELECT 2, 'abc 3/4/16 blah blah 3/7/16 delivery at home xxx cc2' FROM DUAL UNION ALL
SELECT 3, 'xxx 3/5/18 delivery 234 imp happened on 5/8/19 sent home 23f' FROM DUAL UNION ALL
SELECT 4, '3/10/18 bla bla imp-3/9/17 xfe 334 3/4/13 x' FROM DUAL)
SELECT *
FROM TEST
WHERE REGEXP_LIKE(COL , '(home|deliver).{0,30}(deliv|birth|home)');
Result:
ID COL
_____ _______________________________________________________
1 abc 3/4/16 blah blah home delivery
2 abc 3/4/16 blah blah 3/7/16 delivery at home xxx cc2
2 rows selected.