Search code examples
gitgithubgithub-apibinaryfiles

GitHub API returns three bytes for a single-byte ("0xC4") binary file


I'm fighting what is probably an encoding issue, just can't find it. The GitHub API is giving me three bytes instead of one for a file containing only 0xC4. Illustration:

Creating the file:

~/github-binary-api-problem(master*) » echo -n -e '\xc4' > c4-createdfromfilesystem
~/github-binary-api-problem(master*) » hexdump c4-createdfromfilesystem
0000000 c4
0000001

I committed that file to GitHub as usual - go take a look - and GitHub thinks it's a single byte:

GitHub showing "1 lines (1 slow) / 1 Byte"

So far so good. Now I try to download it, using the Contents API (GET /repos/{owner}/{repo}/contents/{path}):

~/github-binary-api-problem(master*) » curl \
-H "Accept: application/vnd.github.v3.raw" \
https://api.github.com/repos/Undo1/github-binary-api-problem/contents/c4-createdfromfilesystem | hexdump
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     3  100     3    0     0      8      0 --:--:-- --:--:-- --:--:--     8
0000000 ef bf bd
0000003
~/github-binary-api-problem(master*) »

And I get three bytes back! This example is in a macOS environment, but I first saw it on Windows. I'm sure it's an encoding issue somewhere in the stack, but I can't find it. What do I need to do to fetch an accurate representation of a binary file from the GitHub API?


Update - I've found that 0xef 0xbf 0xbd is the UTF-8 replacement character, so I'm guessing GitHub's API is trying to UTF-8 encode the file before sending it, even though raw is specified. I've sent GitHub a support ticket.


Solution

  • This looks like it's a genuine issue on GitHub's side. It's possible to clone a repository containing such a file and the resulting file will be correct, but viewing it in the web UI or getting it from the raw API results in a replacement character (EF BF BD).

    As a workaround until your support request gets a response, request the non-raw (JSON) API instead:

    $ curl https://api.github.com/repos/Undo1/github-binary-api-problem/contents/c4-createdfromfilesystem
    {
      "name": "c4-createdfromfilesystem",
      "path": "c4-createdfromfilesystem",
      "sha": "ef6080906700f3f3cdac7d60341a5de7b5da5581",
      "size": 1,
      "url": "https://api.github.com/repos/Undo1/github-binary-api-problem/contents/c4-createdfromfilesystem?ref=master",
      "html_url": "https://github.com/Undo1/github-binary-api-problem/blob/master/c4-createdfromfilesystem",
      "git_url": "https://api.github.com/repos/Undo1/github-binary-api-problem/git/blobs/ef6080906700f3f3cdac7d60341a5de7b5da5581",
      "download_url": "https://raw.githubusercontent.com/Undo1/github-binary-api-problem/master/c4-createdfromfilesystem",
      "type": "file",
      "content": "77+9\n",
      "encoding": "base64",
      "_links": {
        "self": "https://api.github.com/repos/Undo1/github-binary-api-problem/contents/c4-createdfromfilesystem?ref=master",
        "git": "https://api.github.com/repos/Undo1/github-binary-api-problem/git/blobs/ef6080906700f3f3cdac7d60341a5de7b5da5581",
        "html": "https://github.com/Undo1/github-binary-api-problem/blob/master/c4-createdfromfilesystem"
      }
    }
    

    This has a base64-encoded content property, but decoding it reveals that it's EF BF BD again - a replacement character. However, given that the git repository works, it's a fair assumption that the git API might work as well - so, follow the _links.git field:

    $ curl https://api.github.com/repos/Undo1/github-binary-api-problem/git/blobs/ef6080906700f3f3cdac7d60341a5de7b5da5581
    {
      "sha": "ef6080906700f3f3cdac7d60341a5de7b5da5581",
      "node_id": "MDQ6QmxvYjI4MjA0NDU1NzplZjYwODA5MDY3MDBmM2YzY2RhYzdkNjAzNDFhNWRlN2I1ZGE1NTgx",
      "size": 1,
      "url": "https://api.github.com/repos/Undo1/github-binary-api-problem/git/blobs/ef6080906700f3f3cdac7d60341a5de7b5da5581",
      "content": "xA==\n",
      "encoding": "base64"
    }
    

    This also has a base64-encoded content field, which when decoded results in 0xC4, i.e. the correct value.


    For extra points, if you have all the right utilities installed, you can one-line this in a terminal:

    curl https://api.github.com/repos/Undo1/github-binary-api-problem/contents/c4-createdfromfilesystem | jq -r '._links.git' | xargs curl | jq -r '.content' | base64 --decode