Search code examples
phpdomdomparser

Irregular DOM Parsing in php


DOM parsing in php works only if the HTML is perfectly tagged. I need to parse html which is not a perfect DOM. And that HTML is from remote server so i can't change it.

<html>
 <body>
  <table>
   <tr>
    <td>
    1
    </td>
    <td>
    2
    </td></td>
   </tr>
</table>

when i parse html with this structure it gives an error. Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : td in Entity, line: 173 in C:\wamp\wwwxxxxxx on line 51


Solution

  • Tools such as tidy should be able to repair the HTML so you can use it in DOM.

    $html = "<html>
     <body>
      <table>
       <tr>
        <td>
        1
        </td>
        <td>
        2
        </td></td>
       </tr>
    </table>";
    
    $tidy = tidy_parse_string($html);
    
    $html = $tidy->html();
    $cleanHTML = $html->value;
    $doc = new DomDocument();
    $doc-> loadhtml($cleanHTML);
    

    Note: Tidy is not shipped with PHP, you would have to install the extension to use the functions