Search code examples
xmlbashvalidationxmlstarletxmllint

Validate multiple concatenated XML in one file


Multiple XML files were concatenated into one file, see below a demo example. How it is possible to validate it using either xmlstarlet or xmllint command?

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?>
<BookHeaderMsg xmlns:xsi="THE URL" xsi:noNamespaceSchemaLocation="NAME.xsd">
  <BookHdr>
     <tag>value</tag>
     <tag2>value</tag2>
  </BookHdr>
  <Payload>
     <payloadTag>value</payloadTag>
     <payloadTag2>value</payloadTag2>
  </Payload>
</BookHeaderMsg>

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?>
<BookTransfer xmlns:xsi="THE URL" xsi:noNamespaceSchemaLocation="NAME.xsd">
  <BookHdr>
     <tag>value</tag>
     <tag2>value</tag2>
  </BookHdr>
  <Payload>
     <payloadTag>value</payloadTag>
     <payloadTag2>value</payloadTag2>
  </Payload>
</BookTransfer>

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?>
<BookTransfer xmlns:xsi="THE URL" xsi:noNamespaceSchemaLocation="NAME.xsd">
  <BookHdr>
     <tag>value 1</tag>
     <tag2>value 2</tag2>
  </BookHdr>
  <Payload>
     <payloadTag>value 1</payloadTag>
     <payloadTag2>value 2</payloadTag2>
  </Payload>
</BookTransfer>

I tried xmlstarlet val Filename and also xmllint --valid Filename both returned invalid. However, if I split each XML into separate files then they are valid (Unfortunately splitting is not feasible).


Solution

  • I managed to validate XML files combined of multiple of other XML documents following the steps:

    1. Create a loop to iterate through the files
    2. Use csplit command to split XML documents from the combined file
    3. Validate the split XML documents from step 2 using xmlstarlet command and redirect its output to a log file
    4. Remove the split XML documents from step 2 using rm command
    5. Repeat the above processes for other files

    The script:

    #!/bin/bash
    
    SOURCE_DIR="./src"
    LOG_DIR="./log"
    
    
    files=()
    while IFS='' read -r -d ''
    do
            files+=("$REPLY")
    done < <(find "$SOURCE_DIR" -maxdepth 1 -type f -iname "*.xml" -printf '%p\0' | sort -zn)
    
    total="${#files[@]}"
    echo "start validating $total files" > "$LOG_DIR/summary.log"
    counter=0
    for file in "${files[@]}"
    do
            ((counter++))
            # extract
            csplit "$file" --prefix="$file" --suffix-format='_%03d.xml.txt' --keep-files --elide-empty-files '/<?xml/' '{*}' &>/dev/null
            echo "$counter of $total working on $file"
            echo "$counter of $total working on $file" >> "$LOG_DIR/summary.log"
            # validate
            xmlstarlet val "$SOURCE_DIR"/*.xml.txt >> "$LOG_DIR/summary.log"
    
            # clean up
            rm "{$SOURCE_DIR}"/*.xml.txt
    done