Search code examples
powershellfile-format

Using Powershell 7 test files for unix or windows format


I have a vendor who sends 5-10 files a month. Recently they started sending us a mixture of Unix format files among the Windows format files. This vendor is notoriously difficult to work with so this problem will be quicker solved by myself.

I have a short Powershell 7 script that works for the Unix files but creates havoc with the Windows files:

$Files = Get-ChildItem '\\XD1\Vendor_Incoming\*.csv'

foreach($file in $Files){
    (Get-Content -Raw -Path $file) -replace "`n","`r`n" |  Set-Content -NoNewline -Path $file
}

After searching SO and general google searches, I have yet to find an efficient method to test each file for its file format within the foreach statement, something like:

foreach($f in $Files){ If(thisfileisUnix){Process-File} }

Thank you for your time.


Solution

  • Your immediate problem, just to spell it out, is that you're blindly replacing "`n" (LF) instances with "`r`n" (CRLF) sequences, which means that files that already have CRLF sequences are accidentally corrupted, because you're effectively turning their CRLF sequences into CRCRLF sequences ("`r`r`n").


    Note:

    • Both solutions below rely on reading each file into memory in full.
    • If that is undesired, a more complex solution is needed, as demonstrated in Santiago's answer.

    A pragmatic solution that avoids this problem is to simply test for the presence of at least one "`r`n" (CRLF sequence) in the file's content and, if not found, assume that the file uses Unix-format newlines, "`n" (LF) only, and that the content therefore needs transforming:

    Get-ChildItem '\\XD1\Vendor_Incoming\*.csv' | 
      ForEach-Object {
        $text = $_ | Get-Content -Raw
        $isUnixFormat = -not $text.Contains("`r`n")
        if ($isUnixFormat) {
          $text.Replace("`n", "`r`n") |
            Set-Content -NoNewLine -LiteralPath $_.FullName
        }
      }
    

    Note that Set-Content uses its default character encoding, as it knows nothing about the original file's encoding, so you may have to pass an -Encoding argument.


    Here's a more robust, regex-based solution, which, however, is only necessary if there's a chance that any given file may contain a mix of LF and CRLF newlines:

    Get-ChildItem '\\XD1\Vendor_Incoming\*.csv' | 
     ForEach-Object {
       $original = $_ | Get-Content -Raw
       $modified = $original -replace '(?<!\r)\n', "`r`n"
       if (-not [object]::ReferenceEquals($original, $modified)) {
          Set-Content -NoNewLine -LiteralPath $_.FullName -Value $modified
       }
     }
    
    • The above uses a regex with a negative lookbehind assertion ((?<!...)) to match only \n ("`n", LF) characters not preceded by \r ("`r", CR) and replace them with "`r`n" (Windows-format CRLF newlines).

    • It then tests whether any actual replacements were made, taking advantage of the fact that -replace, the regular-expression-based string replacement operator, returns the input string as-is if no actual replacement was made.

    • Only if an actual replacement was made is the modified content written back to the input file.