Search code examples
xmlbashxpathxmllint

Converting two fields of a table in an XML file into CSV using xmllint in bash?


I've got an XML file (converted from HTML) containing fields like this:

<tr>
  <td data-title="Date">2018-01-01</td>
  <td data-title="Version"><a href="https://some-link">25.1</a></td>
</tr>
<tr>
  <td data-title="Date">2018-03-01</td>
  <td data-title="Version"><a href="https://some-link">24.1</a></td>
</tr>

I've been using 'xmllint' to extract single values:

textarea=$(echo "$xml" | xmllint --xpath 'string(//*[@id="content"])' 2>/dev/null )

and multiple values:

list=$(echo "$xml" | xmllint --xpath 'string(/html/body/div/ul)' 2>/dev/null )

but now I want to extract two fields from each record, in CSV format or something similar.

The closest I've got is this:

xpath tr/*[@data-title="Date" or @data-title="Version"]/text()
Object is a Node Set :
Set contains 20 nodes:
1  TEXT
    content=Apr 9, 2018 6:13 PM UTC
2  TEXT
    content=Mar 21, 2018 10:41 PM UTC
3  TEXT
    content=Mar 19, 2018 9:22 PM UTC

Can you show me a way to achieve this with a better xpath?


Solution

  • This is a way to go with xmllint

    xmllint --html --xpath '//tr/td[@data-title="Date"] | //tr/td[@data-title="Version"]' test.html | sed -re 's%(</[^>]+>)%\1\n%g'
    

    Output:

    <td data-title="Date">2018-01-01</td>
    <td data-title="Version"><a href="https://some-link">25.1</a></td>
    <td data-title="Date">2018-03-01</td>
    <td data-title="Version"><a href="https://some-link">24.1</a></td>
    
    • Add --htmloption to signal html input
    • Add // to xpath to search for relative paths. Your xpath does not have any slash at start so that xpath is relative to the current node. On xmllint shell that is related to how you used the cd command.
    • Finally, use the | operator to search for two or more xpaths.