Search code examples
jsonpowershellutf-8character-encodinginvoke-webrequest

JSON rejected for invalid UTF-8 start byte 0xa0, but encoding appears vaild


I'm creating a JSON file in PowerShell 7.4 to send to a 3rd party REST endpoint. Out-File defaults to UTF-8, and when I check the file in Notepad++, the encoding setting appears as UTF-8. Unfortunately, the POST is being rejected with the message:

400 Bad Request
JSON parse error
Nested exception is com.fasterxml.jackson.databind.JsonMappingException: Invalid UTF-8 start byte 0xa0\n at line: 9259, column: 38

I examined the JSON line specified in the error message. The source JSON file has a NO-BREAK SPACE sequence in its company name as follows:

String        Hex
------------  --------------------------------------
"Acme, Inc."  22 41 63 6d 65 2c c2 a0 49 6e 63 2e 22

NO-BREAK SPACE in UTF-8 appears as two bytes: 0xc2 0xa0. Both characters are present in the JSON file, but the error indicates that the remote parser isn't processing the first character as part of the sequence.

Here's the PowerShell script:

# identify CSV file

   $csvFile = Get-ChildItem -Path ($path + '*.csv') -File | 
                 Sort-Object LastWriteTime | 
                 Select-Object -First 1 

# suppress blank lines

   $objData = Get-Content $csvFile -Encoding UTF8 | 
                 Where-Object { $_ } | 
                 ConvertFrom-CSV

# convert to JSON and save to file
     
   $body = $objData | 
              ConvertTo-Json -Depth 100

   $body | 
      Out-File ( $path + 'data.json')
        
# post JSON

    $webParam = @{
       Uri         = $url 
       Method      = 'POST' 
       Headers     =  @{ 'Authorization' = $auth
                         'Cache-Control' = 'no-cache' }
       Body        = $body 
       ContentType = 'application/json'
    }
  
$apiResponse = Invoke-WebRequest @webParam

The data is usually different each time the script runs. On most occasions, the remote site will accept the JSON without an issue because it doesn't have any oddball Unicode characters.

I'm not sure why the remote site doesn't like the string, but the error makes sense if it can't distinguish the entire two-byte sequence. PowerShell's Test-JSON cmdlet always evaluates as true before I send. Has anyone encountered this before?


Solution

  • To ensure that PowerShell uses UTF-8 encoding also in versions 7.3.x and below (including Windows PowerShell) when it transmits the .NET string passed to the -Body parameter of Invoke-WebRequest, use
    -ContentType 'application/json; charset=utf-8' (in PowerShell 7.4+, this is no longer necessary); applied to your splatting scenario:

        $webParam = @{
           Uri         = $url 
           Method      = 'POST' 
           Headers     =  @{ 'Authorization' = $auth
                             'Cache-Control' = 'no-cache' }
           Body        = $body 
           # Note the addition of '; charset=utf-8'
           ContentType = 'application/json; charset=utf-8'
        }
      
    $apiResponse = Invoke-WebRequest @webParam
    
    • That you originally read your text from a UTF-8 file is irrelevant in this case, because by using a .NET string you're delegating the decision as to what character encoding to use to Invoke-WebRequest, and in versions before 7.4 that encoding is ISO-8559-1

      • In this single-byte encoding, the code point of the NO-BREAK SPACE (U+00A0) character is 0xa alone - which amounts to an illegal start byte of a multi-byte UTF-8 sequence encoding a non-ASCII character - this is what your target server complained about.

      • In fact, because ISO-8559-1 forms the 8-bit subrange of Unicode and because of how the UTF-8 Unicode encoding works, all Unicode code points in the range 0x80 - 0xbf are two-byte sequences whose first byte is 0xc2, followed by a byte with the same value as the code point.

      • Thus, NO-BREAK SPACE (U+00A0) - the character whose code point is 0xa0 in both Unicode (abstractly) and ISO-8559-1 (as a concrete, single-byte value) - is the two-byte sequence 0xc2 0xa0 in UTF-8.
        Because, due to mistaken ISO-8559-1 encoding, only 0xa0 was transmitted, it appeared as if the target server ignored the 0xc2, but in actuality it never received it.

    • See this answer for additional information about PowerShell's behavior (which equally applies to the Invoke-RestMethod cmdlet).