Search code examples
jsonpowershellescapingutf-16

How to convert cyrillic into utf16


tl;dr Is there a way to convert cyrillic stored in hashtable into UTF-16? Like кириллица into \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430

I need to import file, parse it into id and value then convert it into .json and now im struggling to find a way to convert value into utf codes.

And yes, it is needed that way

cyrillic.txt:

1 кириллица

PH:

clear-host
foreach ($line in (Get-Content C:\Users\users\Downloads\cyrillic.txt)){
    $nline = $line.Split(' ', 2)
    $properties = @{
        'id'= $nline[0] #stores "1" from file
        'value'=$nline[1] #stores "кириллица" from file
    }
    $temp+=New-Object PSObject -Property $properties
}
$temp | ConvertTo-Json | Out-File "C:\Users\user\Downloads\data.json"

Output:

[
    {
        "id":  "1",
        "value":  "кириллица"
    },
]

Needed:

[
    {
        "id":  "1",
        "value":  "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
    },
]

At this point as a newcomer to PH i have no idea even how to search for it properly


Solution

  • Building on Jeroen Mostert's helpful comment, the following works robustly, assuming that the input file contains no NUL characters (which is usually a safe assumption for text files):

    # Sample value pair; loop over file lines omitted for brevity.
    $nline = '1 кириллица'.Split(' ', 2)
    
    $properties = [ordered] @{
      id = $nline[0]
      # Insert aux. NUL characters before the 4-digit hex representations of each
      # code unit, to be removed later.
      value = -join ([uint16[]] [char[]] $nline[1]).ForEach({ "`0{0:x4}" -f $_ })
    }
    
    # Convert to JSON, then remove the escaped representations of the aux. NUL chars.,
    # resulting in proper JSON escape sequences.
    # Note: ... | Out-File ... omitted.
    (ConvertTo-Json @($properties)) -replace '\\u0000', '\u'
    

    Output (pipe to ConvertFrom-Json to verify that it works):

    [
      {
        "id": "1",
        "value": "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
      }
    ]
    

    Explanation:

    • [uint16[]] [char[]] $nline[1] converts the [char] instances of the strings stored in $nline[1] into the underlying UTF-16 code units (a .NET [char] is an unsigned 16-bit integer encoding a Unicode code point).

      • Note that this works even with Unicode characters that have code points above 0xFFFF, i.e. that are too large to fit into a [uint16]. Such characters outside the so-called BMP (Basic Multilingual Plane), e.g. 👍, are simply represented as pairs of UTF-16 code units, so-called surrogate pairs, which a JSON processor should recognize (ConvertFrom-Json does).
      • However, on Windows such chars. may not render correctly, depending on your console window's font. The safest option is to use Windows Terminal, available in the Microsoft Store
    • The call to the .ForEach() array method processes each resulting code unit:

      • "`0{0:x4}" -f $_ uses an expandable string to create a string that starts with a NUL character ("`0"), followed by a 4-digit hex. representation (x4) of the code unit at hand, created via -f, the format operator.

        • This trick of replacing what should ultimately be a verbatim \u prefix temporarily with a NUL character is needed, because a verbatim \ embedded in a string value would invariably be doubled in its JSON representation, given that \ acts the escape character in JSON.
      • The result is something like "<NUL>043a", which ConvertTo-Json transforms as follows, given that it must escape each NUL character as \u0000:

        "\u0000043a"
        
    • The result from ConvertTo-Json can then be transformed into the desired escape sequences simply by replacing \u0000 (escaped as \\u0000 for use with the regex-based -replace oeprator) with \u, e.g.:

        "\u0000043a" -replace '\\u0000', '\u' # -> "\u043a", i.e. к