Search code examples
htmlregexpcrelookbehindregex-lookarounds

PCRE: (+) and (-) look ahead/behind (Regex)


I have the following string:

<A href="CarPage.asp?parent=CAR123+++&Color=RED">The Car is Red - Its Fast</a>

And I want to extract:

  • CAR123
  • RED
  • The Car is Red - Its Fast

What I have so far is:

(?<=<A href="CarPage\.asp\?parent=)[A-Za-z0-9]*(\+\+\+&Color=)[A-Za-z0-9]{3}(\">)[A-Za-z0-9\- ]*(?=</a>)

But I'm not sure how to set up positive and negative lookahead and lookbehinds when they are not on the string boundaries.

I know, it's HTML...I've heard it before... "Don't parse html with regex..." I don't need anything more elaborate than this.

Help is appreciated.

Thanks!


Solution

  • Better use a parser, but if your link is always formatted in the exact same way (no ids, classes, extra params, params in a different order, etc, try:

    parent=(\w+?)\+*&Color=(\w+?)">(.*?)<
    

    The different with Mu's suggestion is the greediness.