Search code examples
regexexceptionregex-lookaroundslookbehind

Add exceptions to complex regular expression (lookahead and lookbehind utilized)


I'd like some help with regular expressions because I'm not really familiar with. So far, I have created the following regex:

/\b(?<![\#\-\/\>])literal(?![\<\'\"])\b/i

As https://regex101.com/ states:

\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)

Negative Lookbehind (?])

Assert that the Regex below does not match

Match a single character present in the list below [#-/>]

# matches the character # literally (case insensitive)

- matches the character - literally (case insensitive)

/ matches the character / literally (case insensitive)

> matches the character > literally (case insensitive)

literal matches the characters literal literally (case insensitive)

Negative Lookahead (?![\<\'\"])

Assert that the Regex below does not match

Match a single character present in the list below [\<\'\"]

\< matches the character < literally (case insensitive)

\' matches the character ' literally (case insensitive)

\" matches the character " literally (case insensitive)

\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)

Global pattern flags

i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])

I want to add two exceptions to this matching rule. 1) if the ">" is preceded by "p", that is for example a <p> starting tag, to match the literal only. 2) Also the literal should only be matched when < is follwed by /p, that is for example a </p> closing tag. How can achieve this ?

Example: only the bold ones should match.

<p>
    **Literal** in computer science is a
    <a href='http://www.google.com/something/literal#literal'>literal</a>
    for representing a fixed value in source code. Almost all programming 
    <a href='http://www.google.com/something/else-literal#literal'>languages</a>
    have notations for atomic values such as integers, floating-point 
    numbers, and strings, and usually for booleans and characters; some
    also have notations for elements of enumerated types and compound
    values such as arrays, records, and objects. An anonymous function
    is a **literal** for the function type which is **LITERAL**
</p>

I know I have over-complicated things, but the situation is complicated itself and I think I have no other way.


Solution

  • If the text you're searching is just text mixed with some <a> tags, then you can simplify the < and > parts of the lookarounds, and give a specific string that it shouldn't be followed by: </a>.

    /\b(?<![-#\/])literal(?!<\/a>)\b/i
    

    Regex101 Demo