Search code examples
windowspowershellmergegziptar

How to untar multiple files with an extension .tar.gz.aa, .tar.gz.ab..... in windows?


How to untar multiple files with an extension .tar.gz.aa, .tar.gz.ab..... until .tar.gz.an each file being around 10 GB in Windows?

I've tried the following commands in my powershell(with admin rights):

cat <name>.tar.gz.aa | tar xzvf -

cat : Exception of type 'System.OutOfMemoryException' was thrown.
At line:1 char:1
+ cat <name>.tar.gz.aa | tar xzvf –
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Get-Content], OutOfMemoryException
    + FullyQualifiedErrorId : System.OutOfMemoryException,Microsoft.PowerShell.Commands.GetContentCommand
cat *.tar.gz.* | zcat | tar xvf -
zcat : The term 'zcat' is not recognized as the name of a cmdlet, function, script file, or operable program. Check
the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:18
+ cat *.tar.gz.* | zcat | tar xvf -
+                  ~~~~
    + CategoryInfo          : ObjectNotFound: (zcat:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

Thanks in advance! Would be happy to know of any solutions for linux as well, if anyone else might be facing a same difficulty.


Solution

  • You are calling cat (an alias for Get-Content) to enumerate the contents of a single file and then attempting to pass the parsed file content to tar. You were getting the OutOfMemoryException due to this. Get-Content is not designed to read binary files, it's designed to read ASCII and Unicode text files, and certainly not 10GB of them. Even if you had the available memory I don't know how performantly Get-Content would handle single files that large.

    Just pass the file path to tar like this, adding any additional arguments you need such as controlling output directory, etc.:

    tar xvzf "$name.tar.gz.aa"
    

    You can extract all of the archives with a loop in one go (with some helpful output and result checking). This code is also 100% executable in PowerShell Core and should work on Linux:

    Push-Location $pathToFolderWithGzips
    
    try {
      ( Get-ChildItem -File *.tar.gz.a[a-n] ).FullName | ForEach-Object {
        Write-Host "Extracting $_"
        tar xzf $_
      
        if( $LASTEXITCODE -ne 0 ) {
          Write-Warning "tar returned $LASTEXITCODE"
        }
      }
    } finally {
      Pop-Location
    }
    

    Let's break this down:

    • $pathToFolderWithGzips should be set to the full path to the directory containing your tarballs.
    • Push-Location works like cd but uses the location stack. You can return to previous directories with Pop-Location. We change directories to the location we want to extract the files to.
      • Note: PowerShell Core supports the POSIX-like cd - and cd +
    • Wrap the rest in a try block so we can go back to the previous folder location after the try completes.
    • ( Get-ChildItem -File *.tar.gz.a[a-n] ).FullName enumerates all files in the current directory matching the globbing pattern, but making sure the last letter is one of a through n. Accessing the FullName property gives us only the fully-qualified paths for each file which is all we need to pass down the pipeline.
    • | ForEach-Object { ... } will pipe all of the filenames from the FullName values of the previous expression and iterate over each fully-qualified path.
    • Write-Host outputs information to the console via the information stream. This text is not programmatically accessible within the current PowerShell session. Write-Warning is used further on for a similar effect but is visually distinct.
      • Use Write-Output instead if you do want the text to be processed within the same session later on, but usually we want to operate on objects over strings if we can.
    • $_ is an alias for $PSItem, which is an automatic variable used for pipeline context. Every file path iterated over in the ForEach-Object loop will be referenced as $PSItem. We pass the archive path to tar with this variable.
    • $LASTEXITCODE is set when the last executable finishes running. This works similarly to how $? works in bash (though don't confuse this for PowerShell's $?). -ne is the operator for "not equals"
    • finally is used after closing the try block to Pop-Location back to the previous directory. The finally block is always executed *regardless of whether the try code succeeds or fails.
      • I'm admittedly not good with the tar executable so if you know how to control folder output without being in the current directory, you can omit the Push-Location,
        Pop-Location, try, and finally bits and just run what is inside the current try block, modifying the tar command appropriately. You will also need to prefix
        *.tar.gz.a[a-n] with $pathToFolderWithGzips (e.g. $pathToFolderWithGzips\*.tar.gz.a[a-n]) in this case too.