Search code examples
xmlhtml-parsingxquerywebharvest

Web-Harvest determine data type by table column header


I am using Web-Harvest to do some web page "scraping". I have a table of values that I need to organize. The table is very simple. However, I need to use the table's column headings to determine the data types for each column. The table looks something like this....

<table>
  <tr class="header">
    <td>Name</td>
    <td>Age</td>
    <td>Gender</td>
  </tr>
  <tr>
    <td>Bill</td>
    <td>25</td>
    <td>M</td>
  </tr>
  <tr>
    <td>Emily</td>
    <td>31</td>
    <td>F</td>
  </tr>
</table>

I'm using XQuery, as shown in the Web-Harvest examples, and I know how to get the values by index and also using attributes (class, id, etc.) but in this case, I'm not sure how to determine that column 1 is name, column 2 is age, etc... I want to output XML in this form...

<person>
  <name>Bill</name>
  <age>25</age>
  <gender>M</gender>
</person>
<person>
  <name>Emily</name>
  <age>31</age>
  <gender>F</gender>
</person>

I saw this code fragment from another post but I am not exactly sure how it works.

//tr[td[.="Jim"]]/td[count(ancestor::table/thead/tr/th[.="Credit"]/preceding-sibling::*)+1]

It looks like for each row the ancestor (parent?) is referenced to find the corresponding column heading. Other than that, I'm lost. Any additional information would be very helpful.

Thanks in advance.


Solution

  • This XQuery expression:

      <persons>
       {
       let $names := /*/tr[1]/td
         return
             for $tr in /*/tr[position() ge 2]
               return
                 <person>
                   { for $i in 1 to count($tr/td)
                       return
                         element {$names[$i]} {$tr/td[$i]}
                   }
                </person>
         }
       </persons>     
    

    when applied on the provided XML document:

    <table>
      <tr class="header">
        <td>Name</td>
        <td>Age</td>
        <td>Gender</td>
      </tr>
      <tr>
        <td>Bill</td>
        <td>25</td>
        <td>M</td>
      </tr>
      <tr>
        <td>Emily</td>
        <td>31</td>
        <td>F</td>
      </tr>
    </table>
    

    produces the wanted, correct result:

    <persons>
       <person>
          <Name>
             <td>Bill</td>
          </Name>
          <Age>
             <td>25</td>
          </Age>
          <Gender>
             <td>M</td>
          </Gender>
       </person>
       <person>
          <Name>
             <td>Emily</td>
          </Name>
          <Age>
             <td>31</td>
          </Age>
          <Gender>
             <td>F</td>
          </Gender>
       </person>
    </persons>