Search code examples
powershellencodingutf-8powershell-5.0byte-order-mark

Modify the output of this script so that it is read as BOM-less UTF-8


I have several .csv files in a folder with the second column empty and I would like to fill it with the data present in as many csv files with the same name+column3.

Example:

firstfile.csv

header1,translation,source
first,,"third"
one,,three

firstfile_column3.txt (name of the file+_column3.txt)

source
second, fifth
two

firstfileoutput.csv (source became translation)

header1,translation,source
first,"second, fifth","third"
one,two,three

I was able to get it to work, but the accents and Asian characters are wrong. I am using Powershell 5 on Windows and should modify the output so that it is read as BOM-less UTF-8. Trying to edit the last few strings still can't solve it.

param(
    $SourceDir = $PWD,
    $OutDir = $PWD,
    $OutFileSuffix = "output" # Define the suffix for the output file.
)

# Get all primary CSV files in the source directory.
$csvFiles = Get-ChildItem -Path $SourceDir -Recurse -Filter "*.csv"

foreach ($csvFile in $csvFiles) {
    # Construct the name for the corresponding _column3 file.
    $column3FileName = "{0}_column3.txt" -f $csvFile.BaseName
    $column3FilePath = Join-Path -Path $SourceDir -ChildPath $column3FileName
    
    # Check if the _column3 file exists.
    if (Test-Path $column3FilePath) {
        # Import the primary CSV file and the corresponding _column3 file.
        $primaryCsv = Import-Csv -Path $csvFile.FullName
        $column3Data = Get-Content $column3FilePath
        
        # Assuming the first line in the _column3 file is a header and we skip it.
        $column3Values = $column3Data | Select-Object -Skip 1

        # Update the second column (translation) in the primary CSV with data from the _column3 file.
        for ($i = 0; $i -lt $primaryCsv.Count; $i++) {
            $primaryCsv[$i].translation = $column3Values[$i]
        }

        # Construct the output file path.
        $outputFilePath = Join-Path -Path $csvFile.DirectoryName -ChildPath ("{0}{1}.csv" -f $csvFile.BaseName, $OutFileSuffix)

        # Export the updated CSV data to a new file.
        $primaryCsv | Export-Csv -Path $outputFilePath -NoTypeInformation -Encoding UTF8
    }
    else {
        Write-Warning "Corresponding column3 file not found for $($csvFile.Name)"
    }
}

Solution

  • To ensure consistent, BOM-less UTF-8 handling in Windows PowerShell:

    • On reading:

      • Use -Encoding utf8 to ensure that BOM-less UTF-8 files are read as such:

        • Import-Csv -Encoding utf8 $csvFile.FullName
        • Get-Content -Encoding utf8 $column3FilePath
    • On writing:

      • Avoid -Encoding utf8, because it creates files with BOM.

      • A non-obvious workaround for your Export-Csv call is needed, which relies on the fact that New-Item does create BOM-less UTF-8 files, by default and invariably:

         $null = New-Item -Force $outputFilePath -Value (
           $primaryCsv | ConvertTo-Csv -NoTypeInformation | Out-String
         )
        

    Note that none of these things are necessary in PowerShell (Core) 7+, which consistently defaults to (BOM-less) UTF-8, across all built-in cmdlets (as well as when reading source code from files).


    Background information:
    • Unfortunately, Windows PowerShell and its file-reading cmdlets exhibit inconsistent behavior with respect to what character encoding to assume in the absence of a BOM.

      • In your code Import-Csv happens to assume UTF-8, ...

      • ... while Get-Content assumes ANSI, i.e. the system's active legacy ANSI code page, as does the PowerShell engine itself when it reads source code.

      • Similarly, there's no consistence when writing to files: Set-Content uses ANSI encoding, whereas Out-File (and its virtual aliases, > and >>)

    • By contrast, PowerShell (Core) 7+ and its cmdlets - both reading and writing ones - now consistently default to (BOM-less) UTF-8


    The upshot is:

    • In Windows PowerShell, to be safe, always use -Encoding utf8 if you know you're dealing with UTF-8-encoded files - whether they do or do not have a BOM (in the former case, using -Encoding utf8 is technically redundant, but causes no harm).

      • By contrast, this is no longer required in PowerShell (Core) 7+, which in the absence of a BOM assumes UTF-8.
    • The same applies if you want to write to files using UTF-8, but note that in Windows PowerShell -Encoding utf8 invariably creates UTF-8 files with a BOM:

      • See this answer for plain-text workarounds (the New-Item workaround is shown above).

      • By contrast, PowerShell (Core) 7+ creates BOM-less UTF-8 files both by default and with -Encoding utf8; in case you do want a BOM there, use -Encoding utf8BOM