Search code examples
powershellpdfpdftk

Using powershell to merge PDFs in multiple subfolders with pdftk and then delete original PDF files


I have a root folder that contains many subfolders, each with multiple PDFs. I then have a powershell script that goes through the folder structure and creates a merged PDF file (using PDFtk) for each subfolder as follows:

enter image description here

    $pdftk = "C:\Program Files (x86)\PDFtk\bin\pdftk.exe"
    $RootFolder = "path to root folder"
    Get-ChildItem -r -include *.pdf | group DirectoryName | % {& $PDFtk $_.group CAT OUTPUT "$($_.Name | Split-Path -Parent)\$($_.Name | Split-Path -Leaf)_merged.pdf"}

The script works as required, however I will be working with a very large amount of data, and for that reason I need to delete the original PDFs from each folder after the merge is completed.

Basically, i need the script to look in the first folder 4830_2017, create the merged file 4830_2017_merged.pdf and then delete the PDFs located inside the 4830_2017 folder before moving on to the next folder, and doing the same thing.

I am stuggling to find the correct way of deleting the contents of each folder after the merge.

Thanks is advance for your help.


Solution

  • In your ForEach-Object script block, $_.Group contains each group's, i.e. each directory's System.IO.FileInfo instances representing the *.pdf files, so you can pipe them to Remove-Item after a successful merge:

    (Get-ChildItem -Recurse -Filter *.pdf) | 
      Group-Object DirectoryName | 
        ForEach-Object {
          & $PDFtk $_.Group.FullName CAT OUTPUT "$($_.Name | Split-Path -Parent)\$($_.Name | Split-Path -Leaf)_merged.pdf"
          if (0 -eq $LASTEXITCODE) { # If the merge succeeded.
            $_.Group | Remove-Item   # Delete.
          }
        }
    

    Note:

    • The Get-ChildItem command is enclosed in (...) so as to ensure that its output is collected in full before further processing, to rule out side effects from new *.pdf files getting created or old ones getting deleted affecting the recursive enumeration.

      • -Filter *.pdf is used in lieu of -Include *.pdf, which is functionally equivalent in this case, but performs much better, due to delegating the filtering to the file-system APIs rather, at the source - see this answer.
    • & $PDFtk $_.Group was changed to & $PDFtk $_.Group.FullName to ensure that full file paths are passed; note that this is no longer necessary in PowerShell (Core) 7+, where System.IO.FileInfo and System.IO.DirectoryInfo instances consistently stringify to their full paths - see this answer.

    • Group-Object outputs Microsoft.PowerShell.Commands.GroupInfo instances.