I know that my question includes 2 questions...
At first, I want to use xmllint to output "loc" content tags. The sitemap I load has got a xmlns="...".
On xmllint shell, I need to do this:
setrootns
xpath //defaultns:loc
That works... no problem. But I need to do this in a bash script.
(AFAIK) xmllint hasn't got option to tell "let's go, setrootns" so I cannot do this:
xmllint --xpath "//loc" sitemaps.xml
# or
xmllint --xpath "//defaultns:loc" sitemaps.xml
This is the first question, how can I tell to xmllint to load the default ns ?
If I can't, let's take a look on my second solution:
I can remove xmlns attribute and then, there os no ns to use:
xmllint --xpath "//loc" <(sed -r 's/xmlns=".*?"//' sitemaps.xml)
But... now... the whole response of my 500 "loc" content is concatenated in one line !...
I tried this too:
xmllint --shell sitemaps.xml <<EOF
setrootns
xpath //defaultns:loc/text()
EOF
Or again
xmllint --shell sitemaps.xml <<EOF
setrootns
cat //defaultns:loc
EOF
The first gives me (for example)
465 TEXT
content=http://...
with truncated url
The second gives me "------" every 2 lines... and a "/>" at last line...
And I begin to be very nervous... :)
A big thanks if you find any solution.
The goal is to have every location, one per line.
I used to do something similar:
clean_xml_message=$(echo "$xml_message" | sed 's/xmlns/ignore/')
Eventually you could try to put back the new lines:
sed 's/></>\n</g'
I guess you only want the URL without the <loc></loc>
?
Then I would select all the loc elements with xmllint:
<loc>...</loc><loc>...</loc><loc>...</loc>
Then add the new lines: sed 's/<loc>/<loc>\n/g' | sed 's#</loc>#\n</loc>#g'
<loc>
...
</loc><loc>
...
</loc><loc>
...
</loc>
Finally remove the tags grep -v "<loc>" |grep -v "</loc>"
or a single grep -v "$<"
could do it. (-v is the invert selection: http://unixhelp.ed.ac.uk/CGI/man-cgi?grep)