Search code examples
xmlgitbashdiffunzip

Unzip - how do I put inflated files into an array


I have an .xls file which I want to pretty print in order to have nice diffing rather than just binary files being changed.

My approach is to unzip this entire thing. The resulting string does not contain linebreaks so I ran it through xmllint --format. But on this seemingly simple path I have encountered several issues which I have already spent hours on:

  1. unzip multiple files inside the xml archive. This results in invalid xml. Even with unzip -q options I get multiple DTDs and so on. xmllint breaks on this without formatting the input.

    unzip -c -a -q myFile.xlsx | xmllint --format -

  2. I tried splitting the XML into an array using read in order to feed each individual xml file to xmllint. In the result of read most array items seem to be empty and the third and fourth item contain 20something letters of the xml string.

    IFS='\<\?xml' read -r -a files <<< "$decompressed"

  3. I also tried just inserting linebreaks with sed but the filesize is so large that processing takes too long for making it feasible for diffing.

    ${decompressed/\>\</\>\n\</g}

I have just run out of ideas so I decided to consult you guys! Thanks ahead :)


Solution

  • I would extract the xslx into multiple files, pretty-print them and then do a recursive diff on the xml files. (there are also binary files that can't be diffed)

    Like this:

    # Unzip the xlsx files into folders
    unzip -aqd foo foo.xlsx
    unzip -aqd bar bar.xlsx
    
    # Pretty print all .xml and .rels files
    find foo bar \( -name '*.xml' -o -name '*.rels' \) -exec xmllint --format {} --output {} \;
    
    # Now you can recursively diff them
    diff -urN foo bar
    

    You can create bash function that wraps the process.

    Note: The unpacked xlsx file may contain binary files with .bin ending. You need to exclude them from diff:

    diff -urNx '*.bin' foo bar