Search code examples
htmlregexminify

Is it possible to develop fast, safe, streaming regex HTML minification?


I know, one should never parse HTML with regex. And parsing is the only way to get really effective HTML minification.

But what if I'm not that worried about perfection? I just want to get a reasonable amount of whitespace out of my HTML.

And instead of applying a regex to a massive file, I want to apply it to a stream of chunks of the file.

My current solution is simply this:

(?<=>)\s+(?=<)

That finds any places where there are more than one space between open and close HTML tags, e.g. > <. I replace any matches with " " (a single space).

My questions are

  1. Is this safe? i.e. is there anything in typical HTML that this might break?
  2. Can I get better performance (speed and/or more matches) without sacrificing safety?

(P.S.: I've applied this to a file that was ~500kb. It went to 350kb. Using an external minifier took it to 340kb. I'm pretty happy with the 150kb savings and not too worried about the extra 10kb.)


Solution

  • It depends. Consider this HTML snippet:

    <div> 
        <p>Some paragraph here</p>
        <div data-rel="some data > < here"> 
            <p>some subparagraph here</p>
        </div>
    </div>
    

    Here your expression matches the > < inside some potentially important data attribute as well (see a demo here) - this might or might not break your code (ad 1.)

    Concerning your second question (ad 2.), matching is usually faster then lookarounds, so you could as well write:

    >\s+<
    

    And replace this with

    ><
    

    See the reduction in steps compared to your first expression here (259 vs 28 steps, a reduction by ~90 percent).