I'm trying to download the twitter misinformation/elections-integrity dataset at: https://storage.cloud.google.com/twitter-election-integrity/hashed/ira/ira_media_file_list_hashed.txt
But it requires a login. I'm not using Google App Engine, just python 3 running on my laptop. I've written the following code to download the files:
for a_url in download_urls:
filename = os.path.join(data_path, os.path.basename(a_url))
if not os.path.isfile(filename):
#urllib.request.urlretrieve(a_url, filename)
with open(filename, 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, a_url)
c.setopt(c.WRITEDATA, f)
c.setopt(c.CAINFO, certifi.where())
c.perform()
c.close()
Almost all the information online is how to do this from within a GAE environment, and I'm not trying to connect to a bucket.
The URL mentioned indicates that the files are served from Cloud Storage. Since logging in is required it means the objects aren't publicly accessible.
The application serving these files uses a user-centric OAuth 2.0 flow. From Authentication:
Cloud Storage uses OAuth 2.0 for API authentication and authorization. Authentication is the process of determining the identity of a client.
- A user-centric flow allows an application to obtain credentials from an end user. The user signs in to complete authentication.
Is there a way I can download these files while avoiding having to log in to my google account?
The answer here should be no. Otherwise it's a bug - you'd be able to bypass Google Cloud security ;)
I couldn't find specifics for pycurl
, but curl
itself doesn't list OAuth 2.0 as supported. From Features -- what can curl do:
HTTP
- authentication: Basic, Digest, NTLM (*9) and Negotiate (SPNEGO) (*3) to server and proxy
So I think you won't be able to download the files using pycurl
. At least not directly (maybe via a proxy?).
One possible alternative would be to use the Cloud SDK's gsutil in your script (launched as any other external process).:
gcloud auth login
.gsutil
executions inside it will use the previously obtained authentication tokenI see it's possible to install and use gsutil
in standalone mode, without the cloud SDK, but I didn't use it this way. Maybe it's worth investigating for your case. From gsutil config
:
The
gsutil config
command applies to users who have installed gsutil as a standalone tool.The
gsutil config
command obtains access credentials for Google Cloud Storage and writes a boto/gsutil configuration file containing the obtained credentials along with a number of other configuration-controllable values.