In an effort to reduce bandwidth, I am trying to strip out unnecessary whitespace. By "unnecessary", I am referring to any vertical whitespace, and horizontal whitespace at the start or end of lines, but not if it is in a <textarea>
tag.
While I am no stranger to The Pony He Comes, I'm fairly sure a full HTML parser would be overkill for this task. By my understanding, a regex could work.
The regex I have right now is:
$out = preg_replace("/[ \t]*\r?\n[ \t]*/","",$in);
This seems to strip out the whitespace I specify above, except for the <textarea>
rule. My question boils down to: How can I make sure that replacements do not happen within specified boundaries? It can be safely assumed that all HTML entities are properly escaped inside <textarea>
s.
If you have the html:
<P>a
b</P>
And you strip the vertical whitespace you will end up with ab
instead of a b
. So you would need to convert it to a space (which is pointless).
Only stripping near a tag would not help either since you could have (for example) two SPAN
tags near each other.
Whitespace at the start or end of the line you could strip - but only because you already have vertical whitespace.
So if you really wanted to do this you could collapse multiple occurrences of whitespace to a single space.
If you avoided javascript, input fields, pre's, and textareas you should be OK. But without a full parser it's impossible to actually avoid those! For example someone could put a <TEXTAREA>
inside a comment, and without a parser you would keep looking for the end of the textarea and never find it.
But worse is the value
attribute of input
. You don't want to mess with that - but it's completely impossible to even find it without a parser:
<INPUT name="value='hello'" value='name="hi"'>
The color coding makes it clear what the attributes are, but try finding them without a parser.
Avoiding the inside of tags doesn't help either since you can legally put >
inside a comment.