I'm fighting what is probably an encoding issue, just can't find it. The GitHub API is giving me three bytes instead of one for a file containing only 0xC4
. Illustration:
Creating the file:
~/github-binary-api-problem(master*) » echo -n -e '\xc4' > c4-createdfromfilesystem
~/github-binary-api-problem(master*) » hexdump c4-createdfromfilesystem
0000000 c4
0000001
I committed that file to GitHub as usual - go take a look - and GitHub thinks it's a single byte:
So far so good. Now I try to download it, using the Contents API (GET /repos/{owner}/{repo}/contents/{path}
):
~/github-binary-api-problem(master*) » curl \
-H "Accept: application/vnd.github.v3.raw" \
https://api.github.com/repos/Undo1/github-binary-api-problem/contents/c4-createdfromfilesystem | hexdump
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3 100 3 0 0 8 0 --:--:-- --:--:-- --:--:-- 8
0000000 ef bf bd
0000003
~/github-binary-api-problem(master*) »
And I get three bytes back! This example is in a macOS environment, but I first saw it on Windows. I'm sure it's an encoding issue somewhere in the stack, but I can't find it. What do I need to do to fetch an accurate representation of a binary file from the GitHub API?
Update - I've found that 0xef 0xbf 0xbd
is the UTF-8 replacement character, so I'm guessing GitHub's API is trying to UTF-8 encode the file before sending it, even though raw
is specified. I've sent GitHub a support ticket.
This looks like it's a genuine issue on GitHub's side. It's possible to clone a repository containing such a file and the resulting file will be correct, but viewing it in the web UI or getting it from the raw API results in a replacement character (EF BF BD
).
As a workaround until your support request gets a response, request the non-raw (JSON) API instead:
$ curl https://api.github.com/repos/Undo1/github-binary-api-problem/contents/c4-createdfromfilesystem
{
"name": "c4-createdfromfilesystem",
"path": "c4-createdfromfilesystem",
"sha": "ef6080906700f3f3cdac7d60341a5de7b5da5581",
"size": 1,
"url": "https://api.github.com/repos/Undo1/github-binary-api-problem/contents/c4-createdfromfilesystem?ref=master",
"html_url": "https://github.com/Undo1/github-binary-api-problem/blob/master/c4-createdfromfilesystem",
"git_url": "https://api.github.com/repos/Undo1/github-binary-api-problem/git/blobs/ef6080906700f3f3cdac7d60341a5de7b5da5581",
"download_url": "https://raw.githubusercontent.com/Undo1/github-binary-api-problem/master/c4-createdfromfilesystem",
"type": "file",
"content": "77+9\n",
"encoding": "base64",
"_links": {
"self": "https://api.github.com/repos/Undo1/github-binary-api-problem/contents/c4-createdfromfilesystem?ref=master",
"git": "https://api.github.com/repos/Undo1/github-binary-api-problem/git/blobs/ef6080906700f3f3cdac7d60341a5de7b5da5581",
"html": "https://github.com/Undo1/github-binary-api-problem/blob/master/c4-createdfromfilesystem"
}
}
This has a base64-encoded content
property, but decoding it reveals that it's EF BF BD
again - a replacement character. However, given that the git repository works, it's a fair assumption that the git API might work as well - so, follow the _links.git
field:
$ curl https://api.github.com/repos/Undo1/github-binary-api-problem/git/blobs/ef6080906700f3f3cdac7d60341a5de7b5da5581
{
"sha": "ef6080906700f3f3cdac7d60341a5de7b5da5581",
"node_id": "MDQ6QmxvYjI4MjA0NDU1NzplZjYwODA5MDY3MDBmM2YzY2RhYzdkNjAzNDFhNWRlN2I1ZGE1NTgx",
"size": 1,
"url": "https://api.github.com/repos/Undo1/github-binary-api-problem/git/blobs/ef6080906700f3f3cdac7d60341a5de7b5da5581",
"content": "xA==\n",
"encoding": "base64"
}
This also has a base64-encoded content
field, which when decoded results in 0xC4
, i.e. the correct value.
For extra points, if you have all the right utilities installed, you can one-line this in a terminal:
curl https://api.github.com/repos/Undo1/github-binary-api-problem/contents/c4-createdfromfilesystem | jq -r '._links.git' | xargs curl | jq -r '.content' | base64 --decode