Search code examples
xmlpowershellpretty-print

Add carriage returns in Powershell to manually pretty-print a large XML file


I have a really big (280 Meg) xml file that's all on one line. I have a few editors that can barely handle opening it, but nothing will let me pretty print it.

I'm trying to format it in Powershell, but haven't been able to figure out the syntax. What I'd like to do to make the file more readable would be to replace all the closing tags with a carriage return + newline and a closing tag, but I haven't been able get it to work.

Here's what I've tried so far:

(get-content .\ReallyHugeXMLFile2.xml) -replace ('</','`n</') | out-file .\ReallyHugeXMLFile2Formatted.xml
(get-content .\ReallyHugeXMLFile2.xml) -replace ('</','\r\n</') | out-file .\ReallyHugeXMLFile2Formatted2.xml
(get-content .\ReallyHugeXMLFile2.xml) -replace ('</','\\r\\n</') | out-file .\ReallyHugeXMLFile2Formatted3.xml

Thanks


Solution

  • TheIncorrigible1 has provided the crucial pointer in a comment:

    Assuming that your large XML file can still be loaded into a System.Xml.XmlDocument instance as a whole, you can simply invoke its .Save() method in order to create a pretty-printed output file (which obviates the need for manual newline insertion; plus, use of an XML parser is always preferable to text manipulation).

    # Load the file into a [xml] (System.Xml.XmlDocument) instance...
    ($xmlDoc = New-Object xml).Load($PWD.ProviderPath + '/HugeFile.xml')
    # ... and save it, which automatically pretty-prints it.
    $xmlDoc.Save($PWD.ProviderPath + '/HugeFilePrettyPrinted.xml')
    

    Note the need to prepend $PWD.ProviderPath to the filenames to ensure that .NET uses PowerShell's current directory (.NET's usually differs, and .NET doesn't know about PowerShell drives created with New-PSDrive).[1]

    Note: The resulting file will have LF-only newlines, not CRLF ones.


    A feasibility demonstration:

    First, run the following code (PSv5+) to create a sample XML file that is about 280 MB in size. Note that you can easily tweak the code to specify a different target size.

    Note:

    • File HugeFile.xml will be created in the current directory, and running the pretty-printing command later creates an (even larger) HugeFilePrettyPrinted.xml in the same location.

    • Creating this file can take minutes.

    # Create a sample single-line XML file of a given size (approximately).
    # Note: Depending on the target size, this can take a long time to complete.
    #       Additionally, for performance reasons the code is written so that
    #       the file content must fit into memory as a whole.
    
    # The desired size of the resulting file.
    $targetFileSize = 280mb
    $targetFile = './HugeFile.xml'
    
    # The XML element to repeat.
    $repeatingElementTemplate = '<book><title>De Profundis {0:000000000000}</title></book>'
    # Determine how often it must be repeated to reach the target size (approximately)
    $repeatCount = $targetFileSize / ($repeatingElementTemplate.Length - 4)
    
    Write-Verbose -vb "Creating XML file '$targetFile' of approximate size $('{0:N2}' -f ($targetFileSize / 1mb)) MB..."
    # Create the file.
    '<?xml version="1.0"?><catalog>' | Set-Content -NoNewline -Encoding Utf8 $targetFile
    -join (1..$repeatCount).ForEach({ $repeatingElementTemplate -f $_ }) |
      Add-Content -NoNewline -Encoding Utf8 $targetFile
    '</catalog>' | Add-Content -NoNewline -Encoding Utf8 $targetFile
    

    Then, run the pretty-printing command above.

    On my single-core Windows 10 VM with 3GB of RAM (on older hardware), this took about 40 seconds. Eric himself reports less than 5 seconds on his machine.


    [1] Ensuring that a relative PowerShell filesystem path is passed correctly to a .NET method:

    • As stated, .NET's notion of the current directory typically differs from that of PowerShell, so relative PowerShell paths cannot be used as-is.

    • Forming a full path with $PWD.ProviderPath ($PWD.ProviderPath + '<fileInCurrentDir>) ensures that PowerShell's current filesystem location is expressed as a native filesystem path (thanks, TheIncorrigible1). .NET methods only understand the latter; they don't know about custom PowerShell drives created with New-PSDrive, and they don't know PowerShell's provider-prefixed notation, which $PWD stringifies to when the current location is a UNC path (e.g.,
      Microsoft.PowerShell.Core\FileSystem::\\some-server\some-share\some-folder).

    • If you don't use custom PowerShell drives, and you're not running your code directly from UNC locations, you can more simply construct a full path based on the current location with
      "$PWD/<fileInCurrentDir>".

    • Conversely, for full robustness you'll have to use
      (Get-Location -PSProvider FileSystem).ProviderPath + '/<fileInCurrentDir>', given that PowerShell's current location may be one from a provider other than the filesystem provider; e.g., HKCU:\Console (registry provider).