Search code examples
pythonpython-3.xgoogle-cloud-platformurllibpycurl

Download public files from cloud.google.com using python when it requires login


I'm trying to download the twitter misinformation/elections-integrity dataset at: https://storage.cloud.google.com/twitter-election-integrity/hashed/ira/ira_media_file_list_hashed.txt

But it requires a login. I'm not using Google App Engine, just python 3 running on my laptop. I've written the following code to download the files:

for a_url in download_urls:
    filename = os.path.join(data_path, os.path.basename(a_url))

    if not os.path.isfile(filename):
        #urllib.request.urlretrieve(a_url, filename)
        with open(filename, 'wb') as f:
            c = pycurl.Curl()
            c.setopt(c.URL, a_url)
            c.setopt(c.WRITEDATA, f)
            c.setopt(c.CAINFO, certifi.where())
            c.perform()
            c.close()

Is there a way I can download these files while avoiding having to log in to my google account?

Or is there a easy way to login via python?

Almost all the information online is how to do this from within a GAE environment, and I'm not trying to connect to a bucket.


Solution

  • The URL mentioned indicates that the files are served from Cloud Storage. Since logging in is required it means the objects aren't publicly accessible.

    The application serving these files uses a user-centric OAuth 2.0 flow. From Authentication:

    Cloud Storage uses OAuth 2.0 for API authentication and authorization. Authentication is the process of determining the identity of a client.

    • A user-centric flow allows an application to obtain credentials from an end user. The user signs in to complete authentication.

    Is there a way I can download these files while avoiding having to log in to my google account?

    The answer here should be no. Otherwise it's a bug - you'd be able to bypass Google Cloud security ;)

    I couldn't find specifics for pycurl, but curl itself doesn't list OAuth 2.0 as supported. From Features -- what can curl do:

    HTTP

    • authentication: Basic, Digest, NTLM (*9) and Negotiate (SPNEGO) (*3) to server and proxy

    So I think you won't be able to download the files using pycurl. At least not directly (maybe via a proxy?).

    One possible alternative would be to use the Cloud SDK's gsutil in your script (launched as any other external process).:

    • you'd first obtain an authentication token with gcloud auth login.
    • you'd then launch your script, the gsutil executions inside it will use the previously obtained authentication token

    I see it's possible to install and use gsutil in standalone mode, without the cloud SDK, but I didn't use it this way. Maybe it's worth investigating for your case. From gsutil config:

    The gsutil config command applies to users who have installed gsutil as a standalone tool.

    The gsutil config command obtains access credentials for Google Cloud Storage and writes a boto/gsutil configuration file containing the obtained credentials along with a number of other configuration-controllable values.