tl;dr Is there a way to convert cyrillic stored in hashtable into UTF-16?
Like кириллица
into \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430
I need to import file, parse it into id
and value
then convert it into .json and now im struggling to find a way to convert value
into utf codes.
And yes, it is needed that way
cyrillic.txt:
1 кириллица
PH:
clear-host
foreach ($line in (Get-Content C:\Users\users\Downloads\cyrillic.txt)){
$nline = $line.Split(' ', 2)
$properties = @{
'id'= $nline[0] #stores "1" from file
'value'=$nline[1] #stores "кириллица" from file
}
$temp+=New-Object PSObject -Property $properties
}
$temp | ConvertTo-Json | Out-File "C:\Users\user\Downloads\data.json"
Output:
[
{
"id": "1",
"value": "кириллица"
},
]
Needed:
[
{
"id": "1",
"value": "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
},
]
At this point as a newcomer to PH i have no idea even how to search for it properly
Building on Jeroen Mostert's helpful comment, the following works robustly, assuming that the input file contains no NUL
characters (which is usually a safe assumption for text files):
# Sample value pair; loop over file lines omitted for brevity.
$nline = '1 кириллица'.Split(' ', 2)
$properties = [ordered] @{
id = $nline[0]
# Insert aux. NUL characters before the 4-digit hex representations of each
# code unit, to be removed later.
value = -join ([uint16[]] [char[]] $nline[1]).ForEach({ "`0{0:x4}" -f $_ })
}
# Convert to JSON, then remove the escaped representations of the aux. NUL chars.,
# resulting in proper JSON escape sequences.
# Note: ... | Out-File ... omitted.
(ConvertTo-Json @($properties)) -replace '\\u0000', '\u'
Output (pipe to ConvertFrom-Json
to verify that it works):
[
{
"id": "1",
"value": "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
}
]
Explanation:
[uint16[]] [char[]] $nline[1]
converts the [char]
instances of the strings stored in $nline[1]
into the underlying UTF-16 code units (a .NET [char]
is an unsigned 16-bit integer encoding a Unicode code point).
0xFFFF
, i.e. that are too large to fit into a [uint16]
. Such characters outside the so-called BMP (Basic Multilingual Plane), e.g. 👍
, are simply represented as pairs of UTF-16 code units, so-called surrogate pairs, which a JSON processor should recognize (ConvertFrom-Json
does).The call to the .ForEach()
array method processes each resulting code unit:
"`0{0:x4}" -f $_
uses an expandable string to create a string that starts with a NUL
character ("`0"
), followed by a 4-digit hex. representation (x4
) of the code unit at hand, created via -f
, the format operator.
\u
prefix temporarily with a NUL
character is needed, because a verbatim \
embedded in a string value would invariably be doubled in its JSON representation, given that \
acts the escape character in JSON.The result is something like "<NUL>043a"
, which ConvertTo-Json
transforms as follows, given that it must escape each NUL
character as \u0000
:
"\u0000043a"
The result from ConvertTo-Json
can then be transformed into the desired escape sequences simply by replacing \u0000
(escaped as \\u0000
for use with the regex-based -replace
oeprator) with \u
, e.g.:
"\u0000043a" -replace '\\u0000', '\u' # -> "\u043a", i.e. к