I'm trying to reduce the size of a jsonl file by removing irrelevant information.
The jsonl file is a wiktionary download. I'm interested only in those entries which are personal names (i.e. first names and last names).
You can download the file (it's 1.3 gb) here: wiktionary
The issue is that many of the entries are both. For example France is the name of a place, and can also be the name of a country.
So I need to bring in the file, and output it with the irrelevant info removed. However, despite me using
$json | ConvertTo-Json -Compress | Out-File -Append -FilePath $outputFile -Encoding utf8
it does not render all the characters properly - it turns them into /u0000 or just does them wrong. The input is from wiktionary so has every script from every language you could think of.
How can I either change this script to make it render all the chars perfectly? Alternatively how can I make it just delete things from the original file, while leaving what doesn't need to be deleted untouched instead of re-rendering it?
# Initialize the output file
New-Item -Path $outputFile -ItemType File -Force
# Read the input file line by line
Get-Content $inputFile | ForEach-Object {
# Parse the JSON content
try {
$json = $_ | ConvertFrom-Json
} catch {
Write-Host "Failed to parse JSON: $_"
return
}
# Filter the senses
$filteredSenses = $json.senses | Where-Object {
foreach ($link in $_.links) {
if ($link[0] -eq "given name" -or $link[0] -eq "surname") {
return $true
}
}
return $false
}
# If there are filtered senses, update the JSON object and write to the output file
if ($filteredSenses) {
$json.senses = $filteredSenses
$json | ConvertTo-Json -Compress | Out-File -Append -FilePath $outputFile -Encoding utf8
}
}
Write-Host "Filtered data has been written to $outputFile"
If you're using Windows PowerShell (the legacy, ships-with-Windows, Windows-only edition of PowerShell whose latest and last version is 5.1) and your $inputFile
's encoding is BOM-less UTF-8, you'll need to pass -Encoding utf8
to your Get-Content
call to ensure that the file is interpreted correctly.
The reason is that Get-Content
in Windows PowerShell, in the absence of a BOM, defaults to the legacy system locale's ANSI code page (character encoding) and therefore misinterprets BOM-less UTF-8 files.
Also note that using -Encoding utf8
with file-writing cmdlets in Windows PowerShell, such as Set-Content
and Out-File
(whose virtual alias is >
), invariably creates UTF-8 files with a BOM, and workarounds are needed to create BOM-less UTF-8 files.
Note that neither problem affects PowerShell (Core) 7 (the modern, cross-platform, install-on-demand edition), which commendably now consistently defaults to (BOM-less) UTF-8.
For more information, see this answer.