Search code examples
phpdomxpath

Using DOMXpath to find data in not so nice html


I am trying to get some data from a plant list site. This proves to be a bit problematic because their html isn't really well-formed. These are two lines from the search result (disclaimer: I am not responsible for this code):

 <tr>
    <td>
        <i class="glyphicons-icon leaf"></i>
    </td>
    <td>
        <a title="Cimicifuga simplex" href="/taxon/wfo-0000604773" class="result">
            <h4 class="h4Results"><em>Cimicifuga simplex</em>(DC.) Wormsk. ex Turcz.</h4>
        </a>    
        Bull. Soc. Imp. Naturalistes Moscou<br/>
        <div>
            <em>Status:</em><span id="entryStatus">Synonym of&#160;</span>
            <a href="/taxon/wfo-0000519124"><em>Actaea simplex</em>(DC.) Wormsk. ex Prantl</a>
        </div>
        <div>
            <em>Rank:</em><span id="entryRank">Species</span>
        </div>
        <div>
            <em>Family:</em> Ranunculaceae
        </div>
    </td>
    <td>
        <img title="No Image Available" src="/css/images/no_image.jpg" class="thumbnail pull-right"/>
    </td>
</tr>
<tr>
    <td>
        <i class="glyphicons-icon leaf"></i>
    </td>
    <td>
        <a title="Actaea simplex" href="/taxon/wfo-0000519124" class="result">
            <h4 class="h4Results"><strong><em>Actaea simplex</em>(DC.) Wormsk. ex Prantl</strong></h4>
        </a>
        Bot. Jahrb. Syst.<br/>
        <div>
            <em>Status:</em><span id="entryStatus">Accepted Name</span>
        </div>
        <div>
            <em>Rank:</em><span id="entryRank">Species</span>
        </div>
        <div>
            <em>Family:</em> Ranunculaceae</div>
        <div>
            <em>Order:</em> Ranunculales
        </div>
    </td>
    <td>
        <img title="No Image Available" src="/css/images/no_image.jpg" class="thumbnail pull-right"/>
    </td>
</tr>

I added some layout myself, otherwise it wasn't readable.

Anyway, I loaded the page in php and DOMXpath and now I want to get two things:

  • Select the row that has Accepted Name in it
  • Get the species name and the corresponding link from it

In this case the result would be "Actaea simplex" and "/taxon/wfo-0000519124". Mind that there will be more results resembling the first row, and that the position of the row that I am looking for doesn't have to be the second one.

Normally I just try, use google and try some more and in the end I get there, but in this case IDs are used as classes, and are not unique. This make it impossible to use an Xpath tester, and perhaps even useless for DOMXpath.

So, is it possible to get my data with DOMXpath, and if yes - what query do I use?


Solution

  • Try something like:

    $dom = new DOMDocument();
    $dom->loadXML($xml);
    $xpath = new DOMXPath($dom);
    
    $target = $xpath->query("//td[.//span[.='Accepted Name']]/a");
    $link = $target[0]->getAttribute('href');
    $title = $target[0]->getAttribute('title');
    echo $title," ",$link;
    

    Output

    Actaea simplex /taxon/wfo-0000519124