Search code examples
powershellutf-8

PowerShell : Set-Content Replace word and Encoding UTF8 without BOM


I'd like to escape \ to \\ in csv file to upload to Redshift. Following simple PowerShell script can replace $TargetWord \ to $ReplaceWord \\ , as expected, but export utf-8 with bom and sometimes causes the Redshift copy error.

Any advice would be appreciated to improve it. Thank you in advance.

Exp_Escape.ps1

Param(
    [string]$StrExpFile,
    [string]$TargetWord,
    [string]$ReplaceWord
)

# $(Get-Content "$StrExpFile").replace($TargetWord,$ReplaceWord) | Set-Content -Encoding UTF8 "$StrExpFile"

Solution

    • In PowerShell (Core) 7, you get BOM-less UTF-8 files by default; -Encoding utf8 and -Encoding utf8NoBom express that default explicitly; to excplicitly request a BOM, use -Encoding utf8BOM.

    • In Windows PowerShell, unfortunately, you must use a workaround to get BOM-less UTF-8, because -Encoding utf8 only produces UTF-8 files with BOM (and no other utf8-related values are supported).

    The workaround requires combining Out-String with New-Item, which (curiously) creates BOM-less UTF-8 files by default even in Windows PowerShell:

    Param(
        [string]$StrExpFile,
        [string]$TargetWord,
        [string]$ReplaceWord
    )
    
    $null = 
      New-Item -Force $StrExpFile -Value (
        (Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord) | Out-String
      )
    

    Note:

    • $null = is needed to discard the output object that New-Item emits (which is a file-info object describing the newly created file).

    • -Force is needed in order to quietly overwrite an existing file by the same name (as Set-Content and Out-File do by default).

    • The -Value argument must be a single (multi-line) string to write to the file, which is what Out-String ensures.[1]

    Caveats:

    • For non-string input objects, Out-String creates the same rich for-display representations as Out-File and as you would see in the console by default.

    • New-Item itself does not append a trailing newline when it writes the string to the file, but Out-String curiously does; while this happens to be handy here, it is generally problematic, as discussed in GitHub issue #14444.

    • The alternative to using Out-String is to create the multi-line string manually, which is a bit more cumbersome ("`n" is used to create LF-only newlines, which PowerShell and most programs happily accept even on Windows; for platform-native newlines (CRLF) on Windows, use [Environment]::NewLine instead):

       $null = 
         New-Item -Force $StrExpFile -Value (
           ((Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord) -join "`n`") + "`n"
         )
      
    • Since the entire file content must be passed as an argument,[2] it must fit into memory as a whole; the convenience function discussed next avoids this problem.

    For a convenience wrapper function around Out-File for use in Windows PowerShell that creates BOM-less UTF-8 files in streaming fashion, see this answer.


    Alternative, with direct use of .NET APIs:

    .NET APIs produce BOM-less UTF-8 files by default.
    However, because .NET's working directory usually differs from PowerShell's, full file paths must always be used, which requires more effort:

    # In order for .NET API calls to work as expected,
    # file paths must be expressed as *full, native* paths.
    $OutDir = Split-Path -Parent $StrExpFile
    if ($OutDir -eq '') { $OutDir = '.' }
    $strExpFileFullPath = Join-Path (Convert-Path $OutDir) (Split-Path -Leaf $StrExpFile)
    
    # Note: .NET APIs create BOM-less UTF-8 files *by default*
    [IO.File]::WriteAllLines(
      $strExpFileFullPath,
      (Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord)
    )
    

    The above uses the System.IO.File.WriteAllLines method.


    [1] Note that Out-String automatically appends a trailing newline to the string it outputs, which is actually desirable here (to ensure that the file ends with a newline, which New-Item itself doesn't do); however, in general this behavior is problematic, as discussed in GitHub issue #14444.

    [2] Note that while New-Item technically supports receiving the content to write to the file via the pipeline, it unfortunately writes each input object to the target file alone, successively, with only the last one ending up in the file.