Search code examples
xmllinuxbashxpathxmllint

XPath Script does not work for input one, but works perfectly fine for input two


Okay, I'm not really sure how to describe my problem, basically I've got a XML file I'm trying to fetch some information by searching the XML file for a node containing a string.

My problem is that the following script does the job if I only include the <table> tag I'm interested in instead of the whole XML file. That is input two as I call it, which works fine.

But it does not work if I use the whole XML file, here is the script: (tidy -asxml input.xml | xmllint --xpath 'descendant-or-self::*[starts-with(text(), "Aktiv tid:")]/following-sibling::*/text()' -) 2>/dev/null

And here is the input one XML file (complete XML file):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
  <title>SpeedTouch - Bredbandsanslutning</title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <script type="text/javascript">var g_navitem = -1;</script>
  <script type="text/javascript"> var g_focus = -1;</script>
  <script type='text/javascript' src='/util.js'></script>
  <link rel="stylesheet" type="text/css" href="/styles.css">
</head>
<body onLoad="setFocus();" height="100%" style="margin:0px">
  <noscript>
    <h1>Thomson - SpeedTouch</h1>
    <h4>To view the Web interface of your device, JavaScript must be supported and enabled on your browser! <br><br>Aktivera skriptstöd och uppdatera webbläsaren.</h4>
  </noscript>
  <table cellspacing="0" cellpadding="0" border="0" width="100%" style="background-color:white" height="100%">
    <tr>
      <td colspan="2">
        <table width="100%" cellspacing="0" cellpadding="0" border="0">
          <tr>
            <td style="padding-left:15px;" class="Product">THOMSON&nbsp;ST780</td><td align="right" style="padding:5px 15px 0px 0px;"><a href="http://www.thomson-broadband.com"><img src="/images/Thomson.gif" border="0" width="109" height="50" alt="THOMSON logo"></a></td>
          </tr>      
          <tr>
            <td colspan="2">
              <table width="100%" cellspacing="0" cellpadding="0" border="0">
                <tr style="background-image:url(/images/bar.gif)">
                  <td width="20%"></td>
                  <td width="10" align="left"></td>
                  <td width="10"><img src="/images/barend_left.gif"></td>
                  <td><img width="100%" height="10" src="/images/spacer_white.gif"></td>
                </tr>
                <tr style="background-image:url(/images/bar.gif)">
                  <td align="right"><img width="100%" height="10" src="/images/spacer_white.gif"></td>
                  <td width="10"><img src="/images/barend_right.gif"></td>
                  <td colspan="2"></td>
                </tr>
              </table>
            </td>
          </tr>
          <tr>
            <td></td><td align="right" valign="middle" style="padding-right:15px"><form name="langSelect" action="/cgi/language.cgi" method=post><span class="langSelect"><input type="hidden" name=6 value="en">
<a href="" onClick="setLanguage('en');submitForm(document.langSelect,0);return false;" title="English">en</a>&nbsp;
<strong>sv</strong>&nbsp;</span></form></td>
          </tr>
        </table>
      </td>
    </tr>
    <tr>
      <td colspan="2"><img src="/images/spacer.gif" border="0" width="1" height="10" alt=""><br></td>
    </tr>
    <tr>
      <td valign="top" style="padding-top:15px;padding-left:15px;">
        <script type="text/javascript">writeMenu();</script>
      </td>
      <td valign="top" style="background:url(/images/wave.gif) no-repeat top center;height:340px">
        <table cellpadding="0" cellspacing="0" border="0" style="margin-top:15px">
          <script type="text/javascript">writeNavBar();</script>
          <tr>
            <td>
              <table width="700" cellspacing="0" cellpadding="0" border="0">
                <tr>
                  <td>
                   <script type="text/javascript">pm_write_messages();</script>


<div class='contentcontainer'>
<hr>
<div class='contentitem'>
<table cellspacing='0' cellpadding='0'>
<tr><td class='icon' valign='top' width='100px'><img src='/images/cplngrxl.gif' alt='Fysisk anslutning OK'></td>
<td class='data' valign='top'><table cellspacing='0' cellpadding='0'><tr><td align='left'><span class='itemtitle'>DSL-anslutning</span></td><td align='right'></td></tr>
<tr><td colspan='2'><br><table cellspacing='0' cellpadding='0' width='100%'><tr><td width='40' valign='top'><img src='/images/bull__md.gif' alt=''></td><td valign='top'>
<span class='blocktitle'><a href="javascript:GoAndRemember('/cgi/b/dsl/ov/', '')">Visa mer...</a></span><br>
<table width='100%' class='datatable' cellspacing='0' cellpadding='0'>
<tr><td></td><td width='30px'></td><td width='220px'></td><td width='50px'></td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Aktiv tid:</td><td colspan='3'>1 dag, 21:44:06</td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Bandbredd (upp/ned) [kbps/kbps]:</td><td colspan='3'>1.058 / 21.373</td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Överförda data (skickade/mottagna) [GB/GB]:</td><td colspan='3'>1,97 / 45,23</td></tr>
</table>
</td></tr></table>
</td></tr></table></td></tr></table></div>
<hr>
<div class='contentitem'>
<table cellspacing='0' cellpadding='0'>
<tr><td class='icon' valign='top' width='100px'><img src='/images/cintgrxl.gif' alt='Internetanslutning OK'></td>
<td class='data' valign='top'><table cellspacing='0' cellpadding='0'><tr><td align='left'><span class='itemtitle'>Internet</span></td><td align='right'></td></tr>
<tr><td colspan='2'><br><table cellspacing='0' cellpadding='0' width='100%'><tr><td width='40' valign='top'><img src='/images/bull__md.gif' alt=''></td><td valign='top'>
<span class='blocktitle'><a href="javascript:GoAndRemember('/cgi/b/is/_ethoa_/ov/', 'name=Internet')">Visa mer...</a></span><br>
<table width='100%' class='datatable' cellspacing='0' cellpadding='0'>
<tr><td></td><td width='30px'></td><td width='220px'></td><td width='50px'></td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Typ:</td><td colspan='3'>ETHoA</td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Aktiv tid:</td><td colspan='3'>1 dag, 21:44:04</td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>IP-adress:</td><td colspan='3'>x.x.x.x</td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Överförda data (skickade/mottagna) [GB/GB]:</td><td colspan='3'>1,56 / 39,92</td></tr>
</table>
</td></tr></table>
</td></tr></table></td></tr></table></div>
</form>
<script type='text/javascript'>generateTasks()</script>
</div>
                  </td>
                </tr>
              </table>
            </td>
          </tr>
        </table>
      </td>
    </tr>
  </table>
</body>
</html>

And here is the input two XML file:

<hr>
<div class='contentitem'>
<table cellspacing='0' cellpadding='0'>
<tr><td class='icon' valign='top' width='100px'><img src='/images/cplngrxl.gif' alt='Fysisk anslutning OK'></td>
<td class='data' valign='top'><table cellspacing='0' cellpadding='0'><tr><td align='left'><span class='itemtitle'>DSL-anslutning</span></td><td align='right'></td></tr>
<tr><td colspan='2'><br><table cellspacing='0' cellpadding='0' width='100%'><tr><td width='40' valign='top'><img src='/images/bull__md.gif' alt=''></td><td valign='top'>
<span class='blocktitle'><a href="javascript:GoAndRemember('/cgi/b/dsl/ov/', '')">Visa mer...</a></span><br>
<table width='100%' class='datatable' cellspacing='0' cellpadding='0'>
<tr><td></td><td width='30px'></td><td width='220px'></td><td width='50px'></td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Aktiv tid:</td><td colspan='3'>1 dag, 21:44:06</td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Bandbredd (upp/ned) [kbps/kbps]:</td><td colspan='3'>1.058 / 21.373</td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Överförda data (skickade/mottagna) [GB/GB]:</td><td colspan='3'>1,97 / 45,23</td></tr>
</table>
</td></tr></table>
</td></tr></table></td></tr></table></div>
<hr>
<div class='contentitem'>
<table cellspacing='0' cellpadding='0'>
<tr><td class='icon' valign='top' width='100px'><img src='/images/cintgrxl.gif' alt='Internetanslutning OK'></td>
<td class='data' valign='top'><table cellspacing='0' cellpadding='0'><tr><td align='left'><span class='itemtitle'>Internet</span></td><td align='right'></td></tr>
<tr><td colspan='2'><br><table cellspacing='0' cellpadding='0' width='100%'><tr><td width='40' valign='top'><img src='/images/bull__md.gif' alt=''></td><td valign='top'>
<span class='blocktitle'><a href="javascript:GoAndRemember('/cgi/b/is/_ethoa_/ov/', 'name=Internet')">Visa mer...</a></span><br>
<table width='100%' class='datatable' cellspacing='0' cellpadding='0'>
<tr><td></td><td width='30px'></td><td width='220px'></td><td width='50px'></td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Typ:</td><td colspan='3'>ETHoA</td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Aktiv tid:</td><td colspan='3'>1 dag, 21:44:04</td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>IP-adress:</td><td colspan='3'>x.x.x.x</td></tr>
<tr><td height='7' colspan='4'><img src='/images/spacer.gif' width='1' height='7' border='0' alt=''></td></tr>
<tr><td width='170'>Överförda data (skickade/mottagna) [GB/GB]:</td><td colspan='3'>1,56 / 39,92</td></tr>
</table>
</td></tr></table>
</td></tr></table></td></tr></table></div>
</form>
<script type='text/javascript'>generateTasks()</script>
</div>
                  </td>
                </tr>
              </table>
            </td>
          </tr>
        </table>
      </td>
    </tr>
  </table>
</body>
</html>

So for some reason if I remove everything before and including this line:

<div class='contentcontainer'>

the script works fine.

This seems very strange to me, but seems like a very basic problem.

So my question is how can I fix this?

Thanks in advance!


Solution

  • If you tried to run only the first part of your pipeline to inspect its output, you would notice that

    tidy -asxml input.xml
    

    returns no data for the given file.

    This is due to the spurious </form> (as the document contains only a single <form> but two </form>s) -- which HTMLTidy tells you explicitly in its message written to stderr, should you choose to read it.


    In short: While Tidy can clean up documents with warnings, any document with errors needs to be repaired before it can be processed.