I've got an XML file (converted from HTML) containing fields like this:
<tr>
<td data-title="Date">2018-01-01</td>
<td data-title="Version"><a href="https://some-link">25.1</a></td>
</tr>
<tr>
<td data-title="Date">2018-03-01</td>
<td data-title="Version"><a href="https://some-link">24.1</a></td>
</tr>
I've been using 'xmllint' to extract single values:
textarea=$(echo "$xml" | xmllint --xpath 'string(//*[@id="content"])' 2>/dev/null )
and multiple values:
list=$(echo "$xml" | xmllint --xpath 'string(/html/body/div/ul)' 2>/dev/null )
but now I want to extract two fields from each record, in CSV format or something similar.
The closest I've got is this:
xpath tr/*[@data-title="Date" or @data-title="Version"]/text()
Object is a Node Set :
Set contains 20 nodes:
1 TEXT
content=Apr 9, 2018 6:13 PM UTC
2 TEXT
content=Mar 21, 2018 10:41 PM UTC
3 TEXT
content=Mar 19, 2018 9:22 PM UTC
Can you show me a way to achieve this with a better xpath?
This is a way to go with xmllint
xmllint --html --xpath '//tr/td[@data-title="Date"] | //tr/td[@data-title="Version"]' test.html | sed -re 's%(</[^>]+>)%\1\n%g'
Output:
<td data-title="Date">2018-01-01</td>
<td data-title="Version"><a href="https://some-link">25.1</a></td>
<td data-title="Date">2018-03-01</td>
<td data-title="Version"><a href="https://some-link">24.1</a></td>
--html
option to signal html input//
to xpath to search for relative paths. Your xpath does not have any slash at start so that xpath is relative to the current node. On xmllint shell that is related to how you used the cd
command.|
operator to search for two or more xpaths.