Search code examples
powershellazure-devopscharacter-encodingdevopsinvoke-restmethod

How to Get Unicode data from Azure Devops Git Repository Get Item Rest Api?


I prepared following request to get a file content from azure devops reop item api. the file content stored in git in UTF-8 format. but the output of rest api is not as expected! how to fix the issue to get content properly as stored in repo?

$uri = "http://devserver/defaultcollection/3e100875-e1dc-4aa4-a9d0-0e97af8a1634/_apis/git/repositories/f26ea979-3786-4bca-965e-0481c07ff9a9/items/Notes%2FREADME.md?versionType=Commit&version=26613c4596f233b0f48ea0f407465d941f0a4144&api-version=7.0"
$contentType  = "application/json;charset=utf-8"
$headers = @{ Authorization = "Basic $encodedPAT" }

$fileContent = Invoke-RestMethod -Uri $uri -Headers $headers -ContentType $contentType -Method Get

Output is a Markdown content:

Title|Description|WorkItemID|Software|Area|Type|BuildNumber|Date
-|-|-|-|-|-|-|-
رÙع اشکا٠ÙÙاÛØ´ داد٠Ùشد٠Ùا٠ÙÙاÛØ´Û ÙدعÙÛ٠در صÙØ­Ù ÙشاÙد٠جÙسÙ|this is description|409925|Organizer||Bug|20231206.1|2023-12-06

Solution

  • tl;dr

    • Your -ContentType argument has no effect; to ask the target web service to return a JSON response - assuming it supports it - you'll need to:

      • Use an Accept header field, e.g.

         -Headers @{ Accept = 'application/json'; Authorization = "Basic $encodedPAT" }
        
      • Alternatively, if available, in the context of a GET request, use a query-string parameter to that effect as part of the URL.

    • The problem isn't specific to Azure, it is a general problem with PowerShell's web cmdlets: As detailed in the next section, Windows PowerShell and older versions of PowerShell (Core) 7+ mis-decode UTF-8 responses that aren't declared as such in the Content-Type field of the response header. This is no longer a problem in PowerShell (Core) 7.4+, which now (consistently) defaults to UTF-8.

    To ensure decoding as UTF-8, use Invoke-WebRequest rather than Invoke-WebRequest; the former's output objects have a .RawContentStream property that returns a raw byte stream that you can decode with the encoding of choice.

    Applied to your code (as noted, only required in PowerShell versions 7.3.x and below, including in Windows PowerShell):

    $uri = "http://devserver/defaultcollection/3e100875-e1dc-4aa4-a9d0-0e97af8a1634/_apis/git/repositories/f26ea979-3786-4bca-965e-0481c07ff9a9/items/Notes%2FREADME.md?versionType=Commit&version=26613c4596f233b0f48ea0f407465d941f0a4144&api-version=7.0"
    $headers = @{ Authorization = "Basic $encodedPAT" }
    
    $fileContent = 
     [System.Text.Encoding]::UTF8.GetString(
       (
         Invoke-WebRequest -Uri $uri -Headers $headers -Method Get
       ).RawContentStream.ToArray()
     )
    

    Note the use of [System.Text.Encoding]::UTF8 to obtain a UTF-8 encoding, and its .GetString() method to convert an array of bytes to a .NET string.


    Background information:

    • The -ContentType parameter describes the media type and, optionally, character encoding of the body (data) sent with the request, not what you'd like to receive as a response.

      • Since you're merely performing a GET request without using the -Body parameter, the -ContentType argument is effectively ignored.

      • While a header field is generally available that signals to the server what response character encoding is desired - Accept-Charset - it is rarely honored in practice.
        I presume the same applies if you use a charset parameter in the context of also requesting specific media types, via the Accept header field.

    • It is therefore the server that decides what character encoding to encode the response with and, crucially, whether or not to explicitly indicate that encoding in the Content-Type response-header field, e.g. Content-Type: text/markdown; charset=utf-8

      • Strictly speaking, the media type for Markdown text, text/markdown - assuming that it is used in the server's response - should contain a charset parameter, which PowerShell's web cmdlets do honor.

      • In the absence of such a charset parameter, it is therefore the default character encoding that applies, as used by PowerShell's web cmdlets, Invoke-WebRequest and Invoke-RestMethod.

    The default character encoding used by the Invoke-WebRequest and Invoke-RestMethod cmdlets depends on the PowerShell edition and version, as shown in the following table:

    Edition Version Default
    Windows PowerShell up to 5.1, the latest and last version ISO 88591-1[1]
    PowerShell (Core) 7.0 - 7.3.x ISO 88591-1, except for application/json responses,[2] which default to UTF-8
    PowerShell (Core) 7.4 and above UTF-8
    • This default encoding not only applies to decoding responses, but also to encoding request data, namely when you pass a string to the -Body parameter (you may alternatively pass arbitrary [byte] arrays); you can override this with a charset parameter in the -ContentType argument, e.g.:
      -ContentType 'application/json; charset=utf-8'

    • If, in a given call, the response body gets mis-decoded due to the above-mentioned defaults, you need to manually decode the raw bytes, as shown in the top section.


    [1] This encoding is largely identical to Windows-1252, except that the following characters are missing, notably including :
    € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ

    [2] Note that request JSON data passed as a string to the -Body parameter is, curiously, still encoded as ISO 8859-1 by default, an inconsistency that was resolved in v7.4.