Search code examples
sqlstringoracle-databaseregex-group

Regular Expression Help: Extracting Fields


In Oracle, I have a database CLOB field I need to extract data from.

The specification looks like:

[<br /><b>A:</b><br />Text A which does not contain HTML[<br />]]
[<br /><b>B:</b><br />Text B which does not contain HTML[<br />]]
[<br /><b>C:</b><br />Text C which does not contain HTML[<br />]]

Any of the three may be missing or in any order. There is not always a newlines separating the fields, and the fields can contain newlines.

I need:

  1. To be able to recognize if the field is in the right format: I think I'm ok as long as I check that the field starts with <br /><b>, but a better regex would be awesome.
  2. To be able to extract A, B, and/or C sans the "header".

Examples:

field Valid A B C
<br /><b>A:</b><br />Foo<br /> <br /><b>B:</b><br />Bar<br /> <br /><b>C:</b><br />Baz<br /> Yes Foo Bar Baz
<br /><b>A:</b><br />Foo Yes Foo
<br /><b>B:</b><br />Bar<br /> Yes Bar
<br /><b>A:</b><br />Foo <br /><b>B:</b><br />Bar<br /> Yes Foo Bar
<br /><b>A:</b><br />Foo<br /> <br /><b>C:</b><br />Baz<br /> Yes Foo Baz

Are there any regex gurus who might be able to tell me if/how I could extract A, B, and/or C?

Thanks!

Edit: I've added a SQLFiddle at http://sqlfiddle.com/#!4/9aae2/14/0


Solution

  • I asked some friends on Facebook as well.

    The expression that did what I wanted most -- just to return the values of A, B, and C -- was:

    (.*<b>[ABC]:<.b>)(<br .>)?([^<]*)

    using the parameters

    • 1 (start with the beginning of the string)
    • 1 (capture the first occurrence of the pattern in the string)
    • NULL (no special flags)
    • 3 (return the 3rd sub-expression, i.e. the ([^<]*)

    See SQL Fiddle here.

    Thank you to pandachelion and spinjector for helping, and my friend Todd Zimnoch for the regexp (and parameters) that did what I wanted most.