Search code examples
phpregexpcre

PCRE REGEX to match one or more sentences containing a set of characters


I have blocks of text which contain only one specific HTML tag (i.e. the "mark" tag) and I want to extract a paragraph of all the contiguous "sentences" which contain that tag. A "sentence" in my use-case is delimited by a question mark, exclamation mark, full stop or semi-colon.

EDIT: The "mark" tags are being generated automatically on the server-side and they are always well-formed. There is no risk of summoning cthulhu in my use-case.

What I have tried:

Starting with the second result in this PCRE regex, which works for selecting all sentences that contain the word "flung", see for example this regex tester. I have added semi-colons since those are in my use-case as well:

/[^.;?!]*(?<=[.;?\s!])flung(?=[\s.;?!])[^.;?!]*[.;?!]/igm

This works well, except for two issues that I still need help with:

  • How can I exclude decimal numbers e.g. 12.34 during the match? E.g. "Lorem ipsum 12.34 dolor flung sit amet" should be one sentence. Currently, it takes the period in the decimal number as a punctuation, which it isn't. I suppose modifying the REGEX to detect if the decimal has a number or a letter around it will work, but I've attempted a look-ahead constraint such as ?:[^\.]|\.(?=\d) but it doesn't match, or I'm not doing it right.

  • I would like to modify this to match all "mark" HTML tags instead of a word such as "flung". I know that REGEX isn't good for html tags, but an HTML parser can't recognize these characters either (? ! . ;). Maybe I could consider a combination of the two?

What I expect:

Example 1: (basic match)

harum quidem rerum facilis est et expedita distinctio? Nam libero tempore, cum soluta nobis est eligendi optio <mark>cumque</mark> nihil impedit .23 quo minus id 0.89 quod maxime placeat facere possimus, omnis voluptas assumenda est, 12.34 omnis dolor repellendus! Itaque earum rerum hic tenetur a sapiente delectus, quod maxime placeat

should return

Nam libero tempore, cum soluta nobis est eligendi optio <mark>cumque</mark> nihil impedit .23 quo minus id 0.89 quod maxime placeat facere possimus, omnis voluptas assumenda est, 12.34 omnis dolor repellendus!

since that is the sentence which contains the "mark" tag, and the decimal points are not full-stops.

Example 2 (Any sentences that don't contain the tag but are between other tagged sentences will be included as well.)

At vero eos et accusamus et iusto odio dignissimos ducimus. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do <mark>eiusmod</mark> tempor incididunt ut labore et dolore <mark>magna</mark> aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea <mark>commodo</mark> consequat? Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur! Excepteur sint <mark>occaecat</mark> cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum; sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam?

should return the below (note how the sentence "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur!" was included even though it has no tag, because it is between two other matched sentences).

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do <mark>eiusmod</mark> tempor incididunt ut labore et dolore <mark>magna</mark> aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea <mark>commodo</mark> consequat? Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur! Excepteur sint <mark>occaecat</mark> cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum


Solution

  • You may use this PCRE regex that meets your requirements:

    ((?<!\S)[^.?!;]*?<mark>.+?(?>[.?;!](?!\S)|\z))(?>(?>\h+.+?[.;?!])*?\h+(?1))*
    

    RegEx Demo

    RegEx Details:

    • (?<!\S) Assert that we don't a whitespace before the current position
    • [^.?!;]*?: Match 0 or more of any characters that are not listed inside [...]
    • (?:\h+.+?[.?!;])*: Match 0 or more sentences in between marked sentences
    • (?>[.?;!](?!\S)|\z): Assert that we don't a whitespace after the current position after matching sentence terminator or match end of input
    • (?1) recurses the 1st subpattern