Search code examples
phpxpathhref

Selecting a deeply nested link using xpath query


<body class="en-us">   <div id="wrapper">
    <div id="content">
      <div class="content-top">
        <div class="content-bot">
          <div id="profile-wrapper" class=
          "profile-wrapper profile-wrapper-horde">
            <div class="profile-sidebar-anchor">
              <div class="profile-sidebar-outer">
                <div class="profile-sidebar-inner">
                  <div class="profile-sidebar-contents">
                    <div class="profile-sidebar-crest">
                      <a href="/wow/en/character/some-server/sometoon/" rel="np" class="profile-sidebar-character-model" style="">
                      </a>

                      <div class="profile-sidebar-info">
                        <div class="name">
                          <a href="/wow/en/character/some-server/sometoon/"
                          rel="np">Glitchshot</a>
                        </div>

                        <div class="under-name color-c8">
                          <span class="level"><strong>85</strong></span>
                          <a href="/wow/en/game/race/somerace" class="race">somerace</a> 
                          <a href="/wow/en/game/class/someclass" class="class">someclass</a>
                        </div>

                        <div class="guild">
                          <a href="/wow/en/guild/some-server/someguild/?character=sometoon">
                          Some Guild</a>
                        </div>

                        <div class="realm">
                          <span id="profile-info-realm" class="tip"
                          data-battlegroup="Stormstrike">Black
                          Dragonflight</span>
                        </div>
                      </div>
                    </div>

                    <ul class="profile-sidebar-menu" id="profile-sidebar-menu">
                      <li><a href=
                      "/wow/en/character/some-server/sometoon/" class=
                      "back-to" rel="np"><span class="arrow"><span class=
                      "icon">Character Summary</span></span></a></li>

                      <li class="root-menu"><a href=
                      "/wow/en/character/some-server/sometoon/achievement"
                         class="back-to" rel="np"><span class=
                         "arrow"><span class=
                         "icon">Achievements</span></span></a></li>

                      <li class=" active"><a href=
                      "/wow/en/character/some-server/sometoon/achievement#summary"
                         class="" rel="np"><span class="arrow"><span class=
                         "icon">Achievements</span></span></a></li>

                      <li class=""><a href=
                      "/wow/en/character/some-server/sometoon/achievement#92"
                         class="" rel="np"><span class="arrow"><span class=
                         "icon">General</span></span></a></li>

I know that I have posted a lot of useless code here but wanted you guys to have an idea of wwhat the DOM would look like.

From this:

<a href="/wow/en/character/some-server/sometoon/achievement#92" class="" rel="np"><span class="arrow"><span class="icon">General</span></span></a>

I would like to extract this:

/wow/en/character/some-server/sometoon/achievement#92

which comes from the last anchor in the posted markup.

I have read as much as I can find on how to use xpath query to extract the needed information but I am clearly missing something. Below is the query that I thought should work but does not.

<?php
    $query = '*/ul[@class=profile-sidebar-menu]/ul/li[3]/ul/li[1]/a/@href';
    echo $query . "<br>";
    $achievementSubCategory = $xpath->query($query);

    $achiSubArray = array("URL" => $achievementSubCategory->item(0)->nodeValue);
    var_dump($achiSubArray);
    // Produces array(1) { ["URL"]=> NULL } which should look something more like:
    // array(1) { ["URL"]=> /wow/en/character/some-server/sometoon/achievement#92 }
?>

Thank you in advance for your assistance and advice


Solution

  • */ul[@class=profile-sidebar-menu]/ul/li[3]/ul/li[1]/a/@href
    

    There are a few problems with this XPath expression:

    1. It is looking for a ul element that is a crandchild of the current node, and that has an attribute named class whose string value is equal to the string value of one of the children-elements of ul, named profile-sidebar-menu. However, the ul has no children named profile-sidebar-menu and the whole expression doesn't select any node.

    2. Another problem is the indexing. li[3] selects the third li element - child of the context node. However the wanted a element is a child of the fourth li child of the context node. This must be expressed as: li[4]. XPath positions are 1-based, not 0-based.

    If these two problems are corrected, I believe that the corrected expression should look like the following:

    */ul[@class="profile-sidebar-menu"]/ul/li[4]/a/@href
    

    The absolute XPath expression that selects the wanted href attribute starting from the top element body of the provided XML document, is:

    /*/*/*/*/*/*/*/*/*/*/ul/li[4]/a/@href
    

    Below is the XML document (the provided one, made well-formed by appending a number of missing end tags:

    <body class="en-us">
        <div id="wrapper">
            <div id="content">
                <div class="content-top">
                    <div class="content-bot">
                        <div id="profile-wrapper" class=
                  "profile-wrapper profile-wrapper-horde">
                            <div class="profile-sidebar-anchor">
                                <div class="profile-sidebar-outer">
                                    <div class="profile-sidebar-inner">
                                        <div class="profile-sidebar-contents">
                                            <div class="profile-sidebar-crest">
                                                <a href="/wow/en/character/some-server/sometoon/" rel="np" class="profile-sidebar-character-model" style=""></a>
                                                <div class="profile-sidebar-info">
                                                    <div class="name">
                                                        <a href="/wow/en/character/some-server/sometoon/"
                                  rel="np">Glitchshot</a>
                                                    </div>
                                                    <div class="under-name color-c8">
                                                        <span class="level">
                                                            <strong>85</strong>
                                                        </span>
                                                        <a href="/wow/en/game/race/somerace" class="race">somerace</a>
                                                        <a href="/wow/en/game/class/someclass" class="class">someclass</a>
                                                    </div>
                                                    <div class="guild">
                                                        <a href="/wow/en/guild/some-server/someguild/?character=sometoon">
                                  Some Guild</a>
                                                    </div>
                                                    <div class="realm">
                                                        <span id="profile-info-realm" class="tip"
                                  data-battlegroup="Stormstrike">Black
                                  Dragonflight</span>
                                                    </div>
                                                </div>
                                            </div>
                                            <ul class="profile-sidebar-menu" id="profile-sidebar-menu">
                                                <li>
                                                    <a href=
                              "/wow/en/character/some-server/sometoon/" class=
                              "back-to" rel="np">
                                                        <span class="arrow">
                                                            <span class=
                              "icon">Character Summary</span></span>
                                                    </a>
                                                </li>
                                                <li class="root-menu">
                                                    <a href=
                              "/wow/en/character/some-server/sometoon/achievement"
                                 class="back-to" rel="np">
                                                        <span class=
                                 "arrow">
                                                            <span class=
                                 "icon">Achievements</span></span>
                                                    </a>
                                                </li>
                                                <li class=" active">
                                                    <a href=
                              "/wow/en/character/some-server/sometoon/achievement#summary"
                                 class="" rel="np">
                                                        <span class="arrow">
                                                            <span class=
                                 "icon">Achievements</span></span>
                                                    </a>
                                                </li>
                                                <li class="">
                                                    <a href=
                              "/wow/en/character/some-server/sometoon/achievement#92"
                                 class="" rel="np">
                                                        <span class="arrow">
                                                            <span class=
                                 "icon">General</span></span>
                                                    </a>
                                                </li>
                                            </ul>
                                        </div>
                                    </div>
                                </div>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </body>
    

    One can check that the above absolute XPath expression selects exactly the wanted href attribute, by evaluating it with a tool like the Xpath Visualizer.

    Here is a snapshot of the selection, performed with the XPath Visualizer:

    enter image description here