Search code examples
xmlbashxml-parsingsitemapxmllint

xmllint problems to output lines


I know that my question includes 2 questions...

At first, I want to use xmllint to output "loc" content tags. The sitemap I load has got a xmlns="...".

On xmllint shell, I need to do this:

setrootns
xpath //defaultns:loc

That works... no problem. But I need to do this in a bash script.

(AFAIK) xmllint hasn't got option to tell "let's go, setrootns" so I cannot do this:

xmllint --xpath "//loc" sitemaps.xml
# or
xmllint --xpath "//defaultns:loc" sitemaps.xml

This is the first question, how can I tell to xmllint to load the default ns ?

If I can't, let's take a look on my second solution:

I can remove xmlns attribute and then, there os no ns to use:

xmllint --xpath "//loc" <(sed -r 's/xmlns=".*?"//' sitemaps.xml)

But... now... the whole response of my 500 "loc" content is concatenated in one line !...

I tried this too:

xmllint --shell sitemaps.xml <<EOF
setrootns
xpath //defaultns:loc/text()
EOF

Or again

xmllint --shell sitemaps.xml <<EOF
setrootns
cat //defaultns:loc
EOF

The first gives me (for example)

465  TEXT
    content=http://... 

with truncated url

The second gives me "------" every 2 lines... and a "/>" at last line...

And I begin to be very nervous... :)

A big thanks if you find any solution.

The goal is to have every location, one per line.


Solution

  • I used to do something similar:

    clean_xml_message=$(echo "$xml_message" | sed 's/xmlns/ignore/')
    

    Eventually you could try to put back the new lines:

    sed 's/></>\n</g' 
    

    I guess you only want the URL without the <loc></loc> ? Then I would select all the loc elements with xmllint:

    <loc>...</loc><loc>...</loc><loc>...</loc>
    

    Then add the new lines: sed 's/<loc>/<loc>\n/g' | sed 's#</loc>#\n</loc>#g'

    <loc>
    ...
    </loc><loc>
    ...
    </loc><loc>
    ...
    </loc>
    

    Finally remove the tags grep -v "<loc>" |grep -v "</loc>" or a single grep -v "$<" could do it. (-v is the invert selection: http://unixhelp.ed.ac.uk/CGI/man-cgi?grep)