Search code examples
bashawksedgrepxmllint

Extract field from xml file


xml file:

<head>
  <head2>
    <dict type="abc" file="/path/to/file1"></dict>
    <dict type="xyz" file="/path/to/file2"></dict>
  </head2>
</head>

I need to extract the list of files from this. So the output would be

/path/to/file1
/path/to/file2

So far, I've managed to the following.

grep "<dict*file=" /path/to/xml.file | awk '{print $3}' | awk -F= '{print $NF}'

Solution

  • quick and dirty based on your sample, not xml possibilties

    # sed a bit secure
    sed -e '/<head>/,/<\/head>/!d' -e '/.*[[:blank:]]file="\([^"]*\)".*/!d' -e 's//\1/' YourFile
    
    # sed in brute force
    sed -n 's/.*[[:blank:]]file="\([^"]*\)".*/\1/p' -e 's//\1/' YourFile
    
    
    
    # awk quick unsecure using your sample
    awk -F 'file="|">' '/<head>/{h=1} /\/head>{h=0} h && /[[:blank:]]file/ { print $2 }' YourFile
    

    now, i don't promote this kind of extraction on XML unless your really know how is your source in format and content (extra field, escaped quote, content of string like tag format, ...) are a big cause of failure and unexpected result and no more appropriate tools are available

    now to use your own script

    #grep "<dict*file=" /path/to/xml.file | awk '{print $3}' | awk -F= '{print $NF}'
    awk '! /<dict.*file=/ {next} {$0=$3;FS="\"";$0=$0;print $2;FS=OFS}' YourFile
    
    • no need of a grep with awk, use starting pattern filter /<dict.*file/
    • second awk for using a different separator (FS) could be done inside the same script changing FS but because it only occur at next evaluation (next line by default), you could force a reevaluation of current content with $0=$0 in this case