Search code examples
phpxmlxml-parsinghtml-parsing

PHP return XML string with values added to attributes missing values


I have to parse HTML and "HTML" from emails. I've already managed to create a function that cleans most of the errors such as improper nesting of elements.

I'm trying to determine how best to tackle the issue of HTML attributes that are missing values. We must parse everything ultimately as XML so well-formed HTML is a must as well.

The cleaning function starts off simple enough:

$xml = explode('<', $xml);

We quickly determine opening and closing tags of elements.

However once we get to attributes things get really messy really quickly:

  • Missing values.
  • People using single quotes instead of double quotes.
  • Attribute values may contain single quotes.

Here is an example of an HTML string we have to parse (a p element):

$s = 'p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text';

We do not care what those attributes are; our goal is simply to fix the XML so that it is well-formed as demonstrated by the following string:

$s = 'p obnoxious="true" nonprofessional="true" style="wrong: lulz-immature" dunno="true">Some paragraph text';

We're not interested in attribute="attribute" as that is just extra work (most email is frivolous) so we're simply interested in appending ="true" for each attribute missing a value just to prevent the XML parser on client browsers from failing over the trivialities of someone somewhere else not doing their job.

As I mentioned earlier we only need to fix the attributes which are missing values and we need to return a string. At this point all other issues of malformed XML have been addressed. I'm not sure where I should start as the topic is such a mess. So...

  • We're open to sending the entire XML string as a whole to be parsed and returned back as a string with some built in library. If this option presume that the XML is well-formed with a proper XML declaration (<?xml version="1.0" encoding="UTF-8"?>).
  • We're open to manually creating a function to address whatever we encounter though we're not interested in building a validator as much of the "HTML" we receive screams 1997.
  • We are working with the XML as a single string or an array (your pick); we are explicitly not dealing with files.

How do we with reasonable effort ensure that an XML string (in part or whole) is returned as a string with values for all attributes?


Solution

  • The DOM extension may solve your problem:

    $doc = new DOMDocument('1.0');
    $doc->loadHTML('<p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text');
    
    echo $doc->saveXML();
    

    The above code will result in the following output:

    <?xml version="1.0" standalone="yes"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><body><p obnoxious="" nonprofessional="" style="wrong: lulz-immature" dunno="">Some paragraph text</p></body></html>
    

    You may replace every ="" with ="true" if you want, but the output is already a valid XML.