Is it possible to develop fast, safe, streaming regex HTML minification?

I know, one should never parse HTML with regex. And parsing is the only way to get really effective HTML minification.

But what if I'm not that worried about perfection? I just want to get a reasonable amount of whitespace out of my HTML.

And instead of applying a regex to a massive file, I want to apply it to a stream of chunks of the file.

My current solution is simply this:

(?<=>)\s+(?=<)

That finds any places where there are more than one space between open and close HTML tags, e.g. > <. I replace any matches with " " (a single space).

My questions are

Is this safe? i.e. is there anything in typical HTML that this might break?
Can I get better performance (speed and/or more matches) without sacrificing safety?

(P.S.: I've applied this to a file that was ~500kb. It went to 350kb. Using an external minifier took it to 340kb. I'm pretty happy with the 150kb savings and not too worried about the extra 10kb.)

Solution

It depends. Consider this HTML snippet:

<div> 
    <p>Some paragraph here</p>
    <div data-rel="some data > < here"> 
        <p>some subparagraph here</p>
    </div>
</div>

Here your expression matches the > < inside some potentially important data attribute as well (see a demo here) - this might or might not break your code (ad 1.)

Concerning your second question (ad 2.), matching is usually faster then lookarounds, so you could as well write:

>\s+<

And replace this with

><

See the reduction in steps compared to your first expression here (259 vs 28 steps, a reduction by ~90 percent).