I have to parse HTML and "HTML" from emails. I've already managed to create a function that cleans most of the errors such as improper nesting of elements.
I'm trying to determine how best to tackle the issue of HTML attributes that are missing values. We must parse everything ultimately as XML so well-formed HTML is a must as well.
The cleaning function starts off simple enough:
$xml = explode('<', $xml);
We quickly determine opening and closing tags of elements.
However once we get to attributes things get really messy really quickly:
Here is an example of an HTML string we have to parse (a p
element):
$s = 'p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text';
We do not care what those attributes are; our goal is simply to fix the XML so that it is well-formed as demonstrated by the following string:
$s = 'p obnoxious="true" nonprofessional="true" style="wrong: lulz-immature" dunno="true">Some paragraph text';
We're not interested in attribute="attribute"
as that is just extra work (most email is frivolous) so we're simply interested in appending ="true"
for each attribute missing a value just to prevent the XML parser on client browsers from failing over the trivialities of someone somewhere else not doing their job.
As I mentioned earlier we only need to fix the attributes which are missing values and we need to return a string. At this point all other issues of malformed XML have been addressed. I'm not sure where I should start as the topic is such a mess. So...
<?xml version="1.0" encoding="UTF-8"?>
).How do we with reasonable effort ensure that an XML string (in part or whole) is returned as a string with values for all attributes?
The DOM extension may solve your problem:
$doc = new DOMDocument('1.0');
$doc->loadHTML('<p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text');
echo $doc->saveXML();
The above code will result in the following output:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p obnoxious="" nonprofessional="" style="wrong: lulz-immature" dunno="">Some paragraph text</p></body></html>
You may replace every =""
with ="true"
if you want, but the output is already a valid XML.