Search code examples
phphtmlxhtmltidydomxpath

XPATH not working on the HTML


I have a code that reads an HTML file from my local web server localhost and then converts it to XHTML with tidy. Then i load that XHTML into my DOM. the code looks like this

<?php


function getXHTML($html)
{
    $options = array("output-html" => true,"quote-nbsp" => true, "drop-proprietary-attributes" => true,"drop-font-tags" => true,"drop-empty-paras" => true,"hide-comments" => true);
    $tidy=new tidy();
    $xhtml=$tidy->repairString($html,$options);
    echo $xhtml;
    return $xhtml;
}
$content = file_get_contents("http://localhost/filename.htm");
$page = new DOMDocument();
$xpath=new DOMXPath($page);
$content = getXHTML($content);   // this is a tidy function to return XHTML
$page->loadHTML($content);   
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$total = $xpath->query($totalPath);
echo $total->length;    // this shows zero
?> 

the contents of filename.htm looks like this

<!-- saved from url=(0041)http://www.rtu.ac.in/results/reformat.php -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="SHORTCUT ICON" href="http://www.rtu.ac.in/favicon.ico">
<link href="./Result - Rajasthan Technical University6_files/styleresults.css" rel="stylesheet" type="text/css">
<title>Result - Rajasthan Technical University</title>
</head>
<body>


<table width="773" cellpadding="5" cellspacing="0" align="center">
  <tbody><tr height="60">
    <td width="16%" height="60" valign="top"><font color="brown" size="+2"><img src="./Result - Rajasthan Technical University6_files/logo.jpg" width="100" height="102" border="0" align="right">&nbsp;</font></td>
    <td width="72%" height="60" align="center" valign="top"><p><font color="brown" size="+2"><strong>RAJASTHAN TECHNICAL UNIVERSITY </strong></font></p><font color="brown" size="+2">
      <p><font size="+1"><strong>B.Tech -IVth SEMESTER -2010(Main) 16.5.2011</strong></font></p><font size="+1">&nbsp;</font></font></td>      
    <td width="12%" height="80"><strong>www.rtu.ac.in</strong>&nbsp;</td>
  </tr>
</tbody></table>



<br>
<br>
<table width="783" align="center" cellpadding="5" cellspacing="0" class="table"> 
  <tbody>
    <tr>
      <td width="34%" align="center" valign="top" rowspan="2"><strong>Subject(s) Name </strong>&nbsp;</td>
      <td width="10%" align="center" valign="top" colspan="1" rowspan="2"> <strong>Subject(s) Code </strong>&nbsp;</td>

      <td align="center" valign="top" colspan="3" rowspan="1"><strong>Marks Obtained </strong>&nbsp;</td>
    </tr>


    <tr>
      <td width="20%" align="center"><strong>Internal</strong>&nbsp;</td>
      <td width="18%" align="center"><strong>Theory</strong>&nbsp;</td>
      <td width="18%" align="center">&nbsp;</td>
    </tr>




        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-1</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4551</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 16</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 50</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-2</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;4552</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 17</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 61</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-3</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4553</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 19</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 49</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr>
        <tr>
          <td align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-4</strong>&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">4554</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 14</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 68</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        </tr>
        <tr>
          <td align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-5</strong>&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">4555</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 14</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 36</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-6</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4556</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 19</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 48</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr><tr>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;<strong>Internal</strong>&nbsp;</td>
          <td width="18%" align="center" style=" border-bottom: 0px none transparent;"><strong>Practical</strong>&nbsp;</td>
        </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-1</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4174</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 29</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">48</td>
      </tr>




        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-2</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4175</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 16</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">26</td>
      </tr>

      <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-3</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4171</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 15</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">27</td>
      </tr>
      <tr>
        <td align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-4</strong>&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;">4172</td>
        <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;"> 17</td>
        <td align="center" style=" border-bottom: 0px none transparent;">29</td>
        </tr>
      <tr>
        <td align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-5</strong>&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;">4173</td>
        <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;"> 29</td>
        <td align="center" style=" border-bottom: 0px none transparent;">46</td>
        </tr>




        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>Disipline (Deca)</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4176</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">46</td>
      </tr>
  <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr></tbody>
</table>

<br><table width="783" align="center" cellpadding="5" cellspacing="0" class="table">
  <tbody><tr>

    <td width="18%" align="center" valign="top"><strong>Practical Marks   </strong>&nbsp;</td>
    <td width="18%" align="center" valign="top">328</td>
    <td width="19%" align="center" valign="top"><strong>Theory Marks </strong>&nbsp;</td>
    <td width="19%" align="center" valign="top">411</td>
  </tr>

  <tr>
    <td width="18%" align="center"><strong>Institute Code   </strong>&nbsp;</td>
    <td width="18%" align="center"> 1229 </td>
    <td width="19%" align="center"><strong>DECCA </strong>&nbsp;</td>
    <td width="19%" align="center">4176</td>
  </tr>

  <tr>

    <td width="18%" align="center"><strong>Division   </strong>&nbsp;</td>
    <td width="18%" align="center"> PASS </td>
    <td width="19%" align="center"><strong>Grand Total </strong>&nbsp;</td>
    <td width="19%" align="center">739</td>
  </tr>
  </tbody></table>


&nbsp;&nbsp; 
<!-- Reformatter by Shashank Kumar Jain (CS, IIIrd Year, 2010-11) -->


<div id="csscan-wrapper" style="display: none; "><h2 id="csscan-header">element</h2><table id="csscan-table"><tbody><tr><th colspan="2" id="csscan-header-font" class="csscan-header">Font</th></tr><tr id="csscan-row-font-family"><td id="csscan-property-font-family" class="csscan-property">font-family</td><td id="csscan-value-font-family" class="csscan-value"></td></tr><tr id="csscan-row-font-size"><td id="csscan-property-font-size" class="csscan-property">font-size</td><td id="csscan-value-font-size" class="csscan-value"></td></tr><tr id="csscan-row-font-style"><td id="csscan-property-font-style" class="csscan-property">font-style</td><td id="csscan-value-font-style" class="csscan-value"></td></tr><tr id="csscan-row-font-variant"><td id="csscan-property-font-variant" class="csscan-property">font-variant</td><td id="csscan-value-font-variant" class="csscan-value"></td></tr><tr id="csscan-row-font-weight"><td id="csscan-property-font-weight" class="csscan-property">font-weight</td><td id="csscan-value-font-weight" class="csscan-value"></td></tr><tr id="csscan-row-letter-spacing"><td id="csscan-property-letter-spacing" class="csscan-property">letter-spacing</td><td id="csscan-value-letter-spacing" class="csscan-value"></td></tr><tr id="csscan-row-line-height"><td id="csscan-property-line-height" class="csscan-property">line-height</td><td id="csscan-value-line-height" class="csscan-value"></td></tr><tr id="csscan-row-text-decoration"><td id="csscan-property-text-decoration" class="csscan-property">text-decoration</td><td id="csscan-value-text-decoration" class="csscan-value"></td></tr><tr id="csscan-row-text-align"><td id="csscan-property-text-align" class="csscan-property">text-align</td><td id="csscan-value-text-align" class="csscan-value"></td></tr><tr id="csscan-row-text-indent"><td id="csscan-property-text-indent" class="csscan-property">text-indent</td><td id="csscan-value-text-indent" class="csscan-value"></td></tr><tr id="csscan-row-text-transform"><td id="csscan-property-text-transform" class="csscan-property">text-transform</td><td id="csscan-value-text-transform" class="csscan-value"></td></tr><tr id="csscan-row-white-space"><td id="csscan-property-white-space" class="csscan-property">white-space</td><td id="csscan-value-white-space" class="csscan-value"></td></tr><tr id="csscan-row-word-spacing"><td id="csscan-property-word-spacing" class="csscan-property">word-spacing</td><td id="csscan-value-word-spacing" class="csscan-value"></td></tr><tr id="csscan-row-color"><td id="csscan-property-color" class="csscan-property">color</td><td id="csscan-value-color" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-background" class="csscan-header">Background</th></tr><tr id="csscan-row-background-attachment"><td id="csscan-property-background-attachment" class="csscan-property">bg-attachment</td><td id="csscan-value-background-attachment" class="csscan-value"></td></tr><tr id="csscan-row-background-color"><td id="csscan-property-background-color" class="csscan-property">bg-color</td><td id="csscan-value-background-color" class="csscan-value"></td></tr><tr id="csscan-row-background-image"><td id="csscan-property-background-image" class="csscan-property">bg-image</td><td id="csscan-value-background-image" class="csscan-value"></td></tr><tr id="csscan-row-background-position"><td id="csscan-property-background-position" class="csscan-property">bg-position</td><td id="csscan-value-background-position" class="csscan-value"></td></tr><tr id="csscan-row-background-repeat"><td id="csscan-property-background-repeat" class="csscan-property">bg-repeat</td><td id="csscan-value-background-repeat" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-size" class="csscan-header">Box</th></tr><tr id="csscan-row-width"><td id="csscan-property-width" class="csscan-property">width</td><td id="csscan-value-width" class="csscan-value"></td></tr><tr id="csscan-row-height"><td id="csscan-property-height" class="csscan-property">height</td><td id="csscan-value-height" class="csscan-value"></td></tr><tr id="csscan-row-border-top"><td id="csscan-property-border-top" class="csscan-property">border-top</td><td id="csscan-value-border-top" class="csscan-value"></td></tr><tr id="csscan-row-border-right"><td id="csscan-property-border-right" class="csscan-property">border-right</td><td id="csscan-value-border-right" class="csscan-value"></td></tr><tr id="csscan-row-border-bottom"><td id="csscan-property-border-bottom" class="csscan-property">border-bottom</td><td id="csscan-value-border-bottom" class="csscan-value"></td></tr><tr id="csscan-row-border-left"><td id="csscan-property-border-left" class="csscan-property">border-left</td><td id="csscan-value-border-left" class="csscan-value"></td></tr><tr id="csscan-row-margin"><td id="csscan-property-margin" class="csscan-property">margin</td><td id="csscan-value-margin" class="csscan-value"></td></tr><tr id="csscan-row-padding"><td id="csscan-property-padding" class="csscan-property">padding</td><td id="csscan-value-padding" class="csscan-value"></td></tr><tr id="csscan-row-max-height"><td id="csscan-property-max-height" class="csscan-property">max-height</td><td id="csscan-value-max-height" class="csscan-value"></td></tr><tr id="csscan-row-min-height"><td id="csscan-property-min-height" class="csscan-property">min-height</td><td id="csscan-value-min-height" class="csscan-value"></td></tr><tr id="csscan-row-max-width"><td id="csscan-property-max-width" class="csscan-property">max-width</td><td id="csscan-value-max-width" class="csscan-value"></td></tr><tr id="csscan-row-min-width"><td id="csscan-property-min-width" class="csscan-property">min-width</td><td id="csscan-value-min-width" class="csscan-value"></td></tr><tr id="csscan-row-outline-color"><td id="csscan-property-outline-color" class="csscan-property">outline-color</td><td id="csscan-value-outline-color" class="csscan-value"></td></tr><tr id="csscan-row-outline-style"><td id="csscan-property-outline-style" class="csscan-property">outline-style</td><td id="csscan-value-outline-style" class="csscan-value"></td></tr><tr id="csscan-row-outline-width"><td id="csscan-property-outline-width" class="csscan-property">outline-width</td><td id="csscan-value-outline-width" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-position" class="csscan-header">Positioning</th></tr><tr id="csscan-row-position"><td id="csscan-property-position" class="csscan-property">position</td><td id="csscan-value-position" class="csscan-value"></td></tr><tr id="csscan-row-top"><td id="csscan-property-top" class="csscan-property">top</td><td id="csscan-value-top" class="csscan-value"></td></tr><tr id="csscan-row-bottom"><td id="csscan-property-bottom" class="csscan-property">bottom</td><td id="csscan-value-bottom" class="csscan-value"></td></tr><tr id="csscan-row-right"><td id="csscan-property-right" class="csscan-property">right</td><td id="csscan-value-right" class="csscan-value"></td></tr><tr id="csscan-row-left"><td id="csscan-property-left" class="csscan-property">left</td><td id="csscan-value-left" class="csscan-value"></td></tr><tr id="csscan-row-float"><td id="csscan-property-float" class="csscan-property">float</td><td id="csscan-value-float" class="csscan-value"></td></tr><tr id="csscan-row-display"><td id="csscan-property-display" class="csscan-property">display</td><td id="csscan-value-display" class="csscan-value"></td></tr><tr id="csscan-row-clear"><td id="csscan-property-clear" class="csscan-property">clear</td><td id="csscan-value-clear" class="csscan-value"></td></tr><tr id="csscan-row-z-index"><td id="csscan-property-z-index" class="csscan-property">z-index</td><td id="csscan-value-z-index" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-list" class="csscan-header">List</th></tr><tr id="csscan-row-list-style-image"><td id="csscan-property-list-style-image" class="csscan-property">list-style-image</td><td id="csscan-value-list-style-image" class="csscan-value"></td></tr><tr id="csscan-row-list-style-type"><td id="csscan-property-list-style-type" class="csscan-property">list-style-type</td><td id="csscan-value-list-style-type" class="csscan-value"></td></tr><tr id="csscan-row-list-style-position"><td id="csscan-property-list-style-position" class="csscan-property">list-style-position</td><td id="csscan-value-list-style-position" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-table" class="csscan-header">Table</th></tr><tr id="csscan-row-vertical-align"><td id="csscan-property-vertical-align" class="csscan-property">vertical-align</td><td id="csscan-value-vertical-align" class="csscan-value"></td></tr><tr id="csscan-row-border-collapse"><td id="csscan-property-border-collapse" class="csscan-property">border-collapse</td><td id="csscan-value-border-collapse" class="csscan-value"></td></tr><tr id="csscan-row-border-spacing"><td id="csscan-property-border-spacing" class="csscan-property">border-spacing</td><td id="csscan-value-border-spacing" class="csscan-value"></td></tr><tr id="csscan-row-caption-side"><td id="csscan-property-caption-side" class="csscan-property">caption-side</td><td id="csscan-value-caption-side" class="csscan-value"></td></tr><tr id="csscan-row-empty-cells"><td id="csscan-property-empty-cells" class="csscan-property">empty-cells</td><td id="csscan-value-empty-cells" class="csscan-value"></td></tr><tr id="csscan-row-table-layout"><td id="csscan-property-table-layout" class="csscan-property">table-layout</td><td id="csscan-value-table-layout" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-effects" class="csscan-header">Effects</th></tr><tr id="csscan-row-text-shadow"><td id="csscan-property-text-shadow" class="csscan-property">text-shadow</td><td id="csscan-value-text-shadow" class="csscan-value"></td></tr><tr id="csscan-row--webkit-box-shadow"><td id="csscan-property--webkit-box-shadow" class="csscan-property">-webkit-box-shadow</td><td id="csscan-value--webkit-box-shadow" class="csscan-value"></td></tr><tr id="csscan-row-border-radius"><td id="csscan-property-border-radius" class="csscan-property">border-radius</td><td id="csscan-value-border-radius" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-other" class="csscan-header">Other</th></tr><tr id="csscan-row-overflow"><td id="csscan-property-overflow" class="csscan-property">overflow</td><td id="csscan-value-overflow" class="csscan-value"></td></tr><tr id="csscan-row-cursor"><td id="csscan-property-cursor" class="csscan-property">cursor</td><td id="csscan-value-cursor" class="csscan-value"></td></tr><tr id="csscan-row-visibility"><td id="csscan-property-visibility" class="csscan-property">visibility</td><td id="csscan-value-visibility" class="csscan-value"></td></tr></tbody></table></div></body></html>

the XPath above is correct as i have checked it with FirePath. can anyone tell me what i am doing wrong?


Solution

  • the answer to the above question somewhat tricky. my original code looked something like

    $xpath=new DOMXPath($page);
    ..
    ...
    ...
    $page->loadHTML($content);
    ..
    ...
    $totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
    $total = $xpath->query($totalPath);
    ...
    ...
    

    what happens above is that $xpath is created on an empty document because the html is still not loaded in the Dom. so when xpath ran any query it ran the query on an empty document. now i changed the order of the 2 statements

    ...
    ...
    $page->loadHTML($content);
    $xpath=new DOMXPath($page);
    ...
    ...
    $totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
    $total = $xpath->query($totalPath);
    

    now it works because $xpath is created on a nonempty document