Search code examples
jsonpowershellunicode

How can I make sure Powershell will correctly render all non-latin characters into the output?


I'm trying to reduce the size of a jsonl file by removing irrelevant information.

The jsonl file is a wiktionary download. I'm interested only in those entries which are personal names (i.e. first names and last names).

You can download the file (it's 1.3 gb) here: wiktionary

The issue is that many of the entries are both. For example France is the name of a place, and can also be the name of a country.

So I need to bring in the file, and output it with the irrelevant info removed. However, despite me using

$json | ConvertTo-Json -Compress | Out-File -Append -FilePath $outputFile -Encoding utf8

it does not render all the characters properly - it turns them into /u0000 or just does them wrong. The input is from wiktionary so has every script from every language you could think of.

How can I either change this script to make it render all the chars perfectly? Alternatively how can I make it just delete things from the original file, while leaving what doesn't need to be deleted untouched instead of re-rendering it?

# Initialize the output file
New-Item -Path $outputFile -ItemType File -Force

# Read the input file line by line
Get-Content $inputFile | ForEach-Object {
    # Parse the JSON content
    try {
        $json = $_ | ConvertFrom-Json
    } catch {
        Write-Host "Failed to parse JSON: $_"
        return
    }

    # Filter the senses
    $filteredSenses = $json.senses | Where-Object { 
        foreach ($link in $_.links) {
            if ($link[0] -eq "given name" -or $link[0] -eq "surname") {
                return $true
            }
        }
        return $false
    }

    # If there are filtered senses, update the JSON object and write to the output file
    if ($filteredSenses) {
        $json.senses = $filteredSenses
        $json | ConvertTo-Json -Compress | Out-File -Append -FilePath $outputFile -Encoding utf8
    }
}

Write-Host "Filtered data has been written to $outputFile"

Solution

  • If you're using Windows PowerShell (the legacy, ships-with-Windows, Windows-only edition of PowerShell whose latest and last version is 5.1) and your $inputFile's encoding is BOM-less UTF-8, you'll need to pass -Encoding utf8 to your Get-Content call to ensure that the file is interpreted correctly.

    The reason is that Get-Content in Windows PowerShell, in the absence of a BOM, defaults to the legacy system locale's ANSI code page (character encoding) and therefore misinterprets BOM-less UTF-8 files.

    Also note that using -Encoding utf8 with file-writing cmdlets in Windows PowerShell, such as Set-Content and Out-File (whose virtual alias is >), invariably creates UTF-8 files with a BOM, and workarounds are needed to create BOM-less UTF-8 files.

    Note that neither problem affects PowerShell (Core) 7 (the modern, cross-platform, install-on-demand edition), which commendably now consistently defaults to (BOM-less) UTF-8.

    For more information, see this answer.