php regex preg-replace domdocument appendchild

preg_replace vs DOMDocument replaceChild

I was wondering which method mentioned in the title is more efficient to replace content in a html page.

I have this custom tag in my page: <includes module='footer'/> which will be replaced with some content.

Now there are some downsides with using DOMDocument->getElementsByTagName('includes')->item(0)->parentNode->replaceChild for instance when i forgot to add the slash in the tag, like so <includes module='footer'> the whole site crashes.

Regex allows exceptions like these, as long it matches the rule. It would even allow me to replace any string, like {includes:footer}.

Now back to my actual question. Are there any downsides using regex for this purpose, like performance issues...?

More here: Append child/element in head using XML Manipulation

cheers

Solution

I wouldn't be too worried about performance here, I would consider them "comparable". Benchmarks would need to be ran to truly determine this, as it would depend on the size of the document and how the regular expression is written.

Instead, I would be concerned about accuracy. In general DOMDocument will be much better at parsing XML since it was built to read and understand the language. However, it does fail on <includes module='footer'> because it is an un-closed tag (expecting: </includes>).

Most common HTML/XML formatting issues can be fixed with PHP's Tidy class. I would check this out, since you should receive much more "expected results" compared to if you used regex to parse. If you used a regular expression, there could technically be attributes before/after the module, elements within the includes element, unexpected characters like <includes module='foo>bar'>, etc.

In the end, if your XML is in a "controlled" environment (i.e. you know what can and can't happen, you know what possible characters module will contain, you know that it will always be a self closing element containing now children, etc.) than by all means use a regular expression. Just know it is looking for a very specific set of rules. However, if you expect for this to work with "anything you throw at it"..please use a DOM parser (after Tidy'ing to avoid the exceptions), regardless of performance (although I bet it will be very comparable in many instances).

Also, final note, if you plan to find/replace/manipulate many nodes in a document, you will see a large performance increase by going with a DOM parser. A DOM parser will take a document and parse it, once. Then you just traverse the data it already has loaded into its class. This is compared to using regular expressions, where each individual one will be ran across the whole document looking for a set of matches.

If you want me to get more specific in any area (i.e. give a Tidy example, or work on a benchmark), let me know.