Search code examples
regexpcre

Capturing all the occurrences of a specific word when is not part of a link


I'm trying to get with a regex using PCRE2 dialect from an HTML text all the occurrences of the word 'apple'. But excluding when the word apple it's part of a link.
I'm quite a beginner with Regex, probably I'm doing quite a simple mistake.

\bapple\b

So, the following text has to match the first occurrence but not the second and third one.

Lorem ipsum apple sit amet, consectetur <a href="#">apple</a> elit <a href="/test/apple">lorem</a>. 

What am I doing wrong?


Solution

  • In PCRE, you may use this regex:

    ~(?is)<a .*?</a>(*SKIP)(*F)|\bapple\b~
    

    RegEx Demo

    RegEx Details:

    • (?is): Enable ignore case and DOTALL modes
    • <a .*?</a>: Match text from <a to </a> to skip all <a> tage
    • (*SKIP)(*F): together provide a nice alternative of restriction that you cannot have a variable length lookbehind in PCRE regex
    • |: OR
    • \bapple\b: Match word apple