Indent XML-like file with awk and xmllint

I have an "XML-like" file that contains a lot of configuration data. I say "XML-like" because it is really like 3 XML files concatenated together, separated with "]]>]]>"

E.g.

<?xml version="1.0" encoding="UTF-8"?>
<hello><world>"Earth"</world></hello>]]>]]><?xml version="1.0" encoding="UTF-8"?>
<data><lemur><type>"Ring-tailed"</type></lemur></data>]]>]]><?xml version="1.0" encoding="UTF-8"?>
<data><lemur><type>"Mouse"</type></lemur></data>]]>]]>

I am trying to write a script that will call xmllint to indent all of the XML tags in the file. However, xmllint (and many other xml formatting programs) seems to require that there be only one XML document in the file. E.g. the file needs to start with "<?xml version="1.0" encoding="UTF-8"?>" and contain only one root tree.

So I tried writing an awk script that would parse the data into separate chunks and pass it to xmllint, but I am getting an error that I can't get past. I've put the script and the output below.

$ awk '
BEGIN {
    RS = "]]>]]>"
    xmlFormatCommand = "xmllint --format -"
} 

{
    print $0 | xmlFormatCommand 
}
' SmallTest.xml

-:3: parser error : XML declaration allowed only at the start of the document
<?xml version="1.0" encoding="UTF-8"?>
     ^
-:4: parser error : Extra content at the end of the document
<data><lemur><type>"Ring-tailed"</type></lemur></data>
^

If I do it in two separate operations, one where awk prints to three temporary files, and one where xmllint operates on those files, then it works.

E.g.

awk 'BEGIN {RS = "]]>]]>"} {print $0 > "Section_" NR ".txt" }' SmallTest.xml

That results in three files Section_1.txt, Section_2.txt, and Section_3.txt. The contents of Section_2.txt are:

$ cat Section_2.txt
<?xml version="1.0" encoding="UTF-8"?>
<data><lemur><type>"Ring-tailed"</type></lemur></data>

I can format that file with xmllint:

$ cat Section_2.txt | xmllint --format -
<?xml version="1.0" encoding="UTF-8"?>
<data>
  <lemur>
    <type>"Ring-tailed"</type>
  </lemur>
</data>

So I don't understand why I can't just pipe it to xmllint in the first place in the awk script.

I appreciate any help you can provide.

-Jon

Solution

Your problem, in a nutshell, is that awk keeps using the same pipe. The pipe is remembered under the exact same string with which it was opened (which means that you cannot run the exact same command twice at the same time), and records are written into it one after the other, so you have only one xmllint process that gets the whole file as input.

You can fix this by closing the pipe after every record:

$ awk '
BEGIN {
    RS = "]]>]]>"
    xmlFormatCommand = "xmllint --format -"
} 

{
    print $0 | xmlFormatCommand 
    close(xmlFormatCommand)      # <-- HERE
}
' SmallTest.xml

Here close accepts as argument the identifier under which the pipe is remembered (the command). I am aware that this looks strange compared to other programming languages.

Since you will have an empty record at the end with the file in your question, by the way, you may want to put a condition in there that excludes such empty records. For example,

$ awk '
BEGIN {
    RS = "]]>]]>"
    xmlFormatCommand = "xmllint --format -"
} 

! /^\s*$/ {  # <-- HERE
    print $0 | xmlFormatCommand 
    close(xmlFormatCommand)
}
' SmallTest.xml

where /^\s*$/ matches records that have only whitespace between the beginning and the end, and ! inverts that match.