Search code examples
.netxmlpowershellzippowershell-4.0

Is it possible to extract / read part of a file within a zip using powershell?


I have a powershell 4.0 script that does various things to organise some large zip files across an internal network. This is all working fine, but I am looking to make a few improvements. One thing that I want to do is extract some details that are within an XML file within the ZIP files.

I tested this on some small ZIP files by extracting just the XML which worked fine. I target the specific file because the zip can contain thousands of files that can be pretty large. This worked fine on my test files, but when I expanded the testing, I realised this wasn't particularly optimal because the XML files I am reading can get pretty large themselves (one was ~5GB but they could potentially be larger). So adding a file extraction step to the chain creates an unacceptable delay to the process, and I need to find an alternative.

Ideally, I would be able read the 3-5 values from the XML file from within the ZIP without extracting it. The values are always relatively early on in the file, so perhaps its possible to just extract the first ~100kb of the file and I could treat the extract as a text file and find the values required?

Is this possible / more performant than just extracting the entire file?

If I can't speed things up I'll have to look at another way. I do have limited control over the file content, so could potentially look at splitting out those details into a smaller separate file at ZIP creation. This would be a last resort though.


Solution

  • You should be able to do this with the System.IO.Compression.ZipFile class:

    # import the containing assembly
    Add-Type -AssemblyName System.IO.Compression.FileSystem
    
    try{
      # open the zip file with ZipFile
      $zipFileItem = Get-Item .\Path\To\File.zip
      $zipFile = [System.IO.Compression.ZipFile]::OpenRead($zipFileItem.FullName)
    
      # find the desired file entry
      $compressedFileEntry = $zipFile.Entries |Where-Object Name -eq MyAwesomeButHugeFile.xml
    
      # read the first 100kb of the file stream:
      $buffer = [byte[]]::new(100KB)
      $stream = $compressedFileEntry.Open()
      $readLength = $stream.Read($buffer, 0, $buffer.Length)
    }
    finally{
      # clean up
      if($stream){ $stream.Dispose() }
      if($zipFile){ $zipFile.Dispose() }
    }
    
    if($readLength){
      $xmlString = [System.Text.Encoding]::UTF8.GetString($buffer, 0, $readLength)
      # do what you must with `$xmlString` here :)
    }
    else{
      Write-Warning "Failed to extract partial xml string"
    }