Search code examples
python-3.xgoogle-cloud-platformurllibgoogle-container-registryhttpx

How does urllib.request differ from curl or httpx in behaviour? Getting a 401 in a request to the Google Container Registry


I am currently working on some code to interact with images on the Google Container Registry. I have working code both using plain curl and also httpx. I am trying to build a package without 3rd party dependencies. My curiosity is around a particular endpoint from which I get a successful response in curl and httpx but a 401 Unauthorized using urllib.request.

The bash script that demonstrates what I'm trying to achieve is the following. It retrieves an access token from the registry API, then uses that token to verify that the API indeed runs version 2 and tries to access a particular Docker image configuration. I'm afraid that in order to test this, you will need access to a private GCR image and a digest for one of the tags.

#!/usr/bin/env bash

set -eu

token=$(gcloud auth print-access-token)
image=...
digest=sha256:...

get_token() {
    curl -sSL \
        -G \
        --http1.1 \
        -H "Authorization: Bearer ${token}" \
        -H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
        --data-urlencode "scope=repository:$1:pull" \
        --data-urlencode "service=gcr.io" \
        "https://gcr.io/v2/token" | jq -r '.token'
}

echo "---"
echo "Retrieving access token."
access_token=$(get_token ${image})

echo
echo "---"
echo "Testing version 2 capability with access token."
curl -sSL \
    --http1.1 \
    -o /dev/null \
    -w "%{http_code}" \
    -H "Authorization: Bearer ${access_token}" \
    -H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
    https://gcr.io/v2/

echo
echo "---"
echo "Retrieving image configuration with access token."
curl -vL \
    --http1.1 \
    -o /dev/null \
    -w "%{http_code}" \
    -H "Authorization: Bearer ${access_token}" \
    -H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
    "https://gcr.io/v2/${image}/blobs/${digest}"

I additionally created two Jupyter notebooks demonstrating my solutions in httpx and bare urllib.request. The httpx one works perfectly while somehow urllib fails on the image configuration request. I'm running out of ideas trying to spot the difference. If you run the notebook yourself, you will see that the called URL contains a token as a query parameter (is this a security issue?). When I open that link I can actually successfully download the data myself. Maybe urllib still passes along the Authorization header with the Bearer token making that last call fail with 401 Unauthorized?

Any insights are greatly appreciated.


Solution

  • I did some investigation and I believe the difference is that the last call to "https://gcr.io/v2/${image}/blobs/${digest}" actually contains a redirect. Inspecting the curl and httpx calls showed me that both do not include the Authorization header in the second, redirected request, whereas in the way that I set up the urllib.request in the notebook, this header is always included. It's a bit odd that this leads to a 401 but now I know how to address it.

    Edit: I can now confirm that by building a urllib.request.Request instance and unlike in the linked notebook, add the authorization header with the request's add_unredirected_header method, everything works as expected.