Search code examples
htmlwhitespacebandwidth

Strip unnecessary whitespace - "unnecessary" being key


In an effort to reduce bandwidth, I am trying to strip out unnecessary whitespace. By "unnecessary", I am referring to any vertical whitespace, and horizontal whitespace at the start or end of lines, but not if it is in a <textarea> tag.

While I am no stranger to The Pony He Comes, I'm fairly sure a full HTML parser would be overkill for this task. By my understanding, a regex could work.

The regex I have right now is:

$out = preg_replace("/[ \t]*\r?\n[ \t]*/","",$in);

This seems to strip out the whitespace I specify above, except for the <textarea> rule. My question boils down to: How can I make sure that replacements do not happen within specified boundaries? It can be safely assumed that all HTML entities are properly escaped inside <textarea>s.


Solution

  • If you have the html:

    <P>a
    b</P>
    

    And you strip the vertical whitespace you will end up with ab instead of a b. So you would need to convert it to a space (which is pointless).

    Only stripping near a tag would not help either since you could have (for example) two SPAN tags near each other.

    Whitespace at the start or end of the line you could strip - but only because you already have vertical whitespace.

    So if you really wanted to do this you could collapse multiple occurrences of whitespace to a single space.

    If you avoided javascript, input fields, pre's, and textareas you should be OK. But without a full parser it's impossible to actually avoid those! For example someone could put a <TEXTAREA> inside a comment, and without a parser you would keep looking for the end of the textarea and never find it.

    But worse is the value attribute of input. You don't want to mess with that - but it's completely impossible to even find it without a parser:

    <INPUT name="value='hello'" value='name="hi"'>
    

    The color coding makes it clear what the attributes are, but try finding them without a parser.

    Avoiding the inside of tags doesn't help either since you can legally put > inside a comment.