Search code examples
phpweb-crawlerhtml-parsingsimplexmldomdocument

Extracting certain portions of HTML from within PHP


Ok, so I'm writing an application in PHP to check my sites if all the links are valid, so I can update them if I have to.

And I ran into a problem. I've tried to use SimpleXml and DOMDocument objects to extract the tags but when I run the app with a sample site I usually get a ton of errors if I use the SimpleXml object type.

So is there a way to scan the html document for href attributes that's pretty much as simple as using SimpleXml?

    <?php
    // what I want to do is get a similar effect to the code described below:

    foreach($html->html->body->a as $link)
    {
         // store the $link into a file
         foreach($link->attributes() as $attribute=>$value);
         {
              //procedure to place the href value into a file
         }
    }
?>

so basically i'm looking for a way to preform the above operation. The thing is I'm currently getting confused as to how should I treat the string that i'm getting with the html code in it...

just to be clear, I'm using the following primitive way of getting the html file:

<?php
$target      = "http://www.targeturl.com";

$file_handle = fopen($target, "r");

$a = "";

while (!feof($file_handle)) $a .= fgets($file_handle, 4096);

fclose($file_handle);
?>

Any info would be useful as well as any other language alternatives where the above problem is more elegantly fixed (python, c or c++)


Solution

  • You can use DOMDocument::loadHTML

    Here's a bunch of code we use for a HTML parsing tool we wrote.

    $target = "http://www.targeturl.com";
    $result = file_get_contents($target);
    $dom = new DOMDocument;
    $dom->preserveWhiteSpace = false;
    @$dom->loadHTML($result);
    
    $links = extractLink(getTags( $dom, 'a', ));
    
    function extractLink( $html, $argument = 1 ) {
      $href_regex_pattern = '/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si';
    
      preg_match_all($href_regex_pattern,$html,$matches);
    
      if (count($matches)) {
    
        if (is_array($matches[$argument]) && count($matches[$argument])) {
          return $matches[$argument][0];
        }
    
        return $matches[1];
      } else 
    
    function getTags( $dom, $tagName, $element = false, $children = false ) {
        $html = '';
        $domxpath = new DOMXPath($dom);
    
        $children = ($children) ? "/".$children : '';  
        $filtered = $domxpath->query("//$tagName" . $children);
    
        $i = 0;
        while( $myItem = $filtered->item($i++) ){
            $newDom = new DOMDocument;
            $newDom->formatOutput = true;        
    
            $node = $newDom->importNode( $myItem, true );
    
            $newDom->appendChild($node);
            $html[] = $newDom->saveHTML();          
        }
    
        if ($element !== false && isset($html[$element])) {
          return $html[$element];
        } else
          return $html;
    }