Search code examples
regexreplacenewlinewhitespacelinefeed

Regular expression to replace line feeds with a space only if the break is not in the contents of an HTML attribute


I'm trying to write a regular expression that replaces line feeds between certain areas of a text file, but only on plain text content (i.e. excludes text inside HTML attribute contents, like href) but not having much luck past the first part.

Example input:

AUTHOR: Me
DATE: Now
CONTENT:
This is an example. This is another example. <a href="http://www.stackoverflow/example-
link-that-breaks">This is an example.</a> This is an example. This is yet another
example.
END CONTENT
COMMENTS: 0

Example output:

AUTHOR: Me
DATE: Now
CONTENT:
This is an example. This is another example. <a href="http://www.stackoverflow/example-link-that-breaks">This is an example.</a> This is an example. This is yet another example.
END CONTENT
COMMENTS: 0

So ideally, a space replaces line breaks if they occur in plain text, but removes them without adding a space if they are inside HTML parameters (mostly href, and I'm fine if I have to limit it to that).


Solution

  • This will remove newlines in attribute values, assuming the values are enclosed in double-quotes:

    $s = preg_replace(
           '/[\r\n]+(?=[^<>"]*+"(?:[^<>"]*+"[^"<>]*+")*+[^<>"]*+>)/',
           '', $s);
    

    The lookahead asserts that, between the current position (where the newline was found) and the next >, there's an odd number of double-quotes. This doesn't allow for single-quoted values, or for angle brackets inside the values; both can be accommodated if need be, but this is ugly enough already. ;)

    After that, you can replace any remaining newlines with spaces:

    $s = preg_replace('/[\r\n]+/', ' ', $s);
    

    See it in action on ideone.com.