While trying to run a Latent Dirichlet Allocation code in Python, the following line is having an error:
docs = fetch_20newsgroups(subset = 'all', remove = ('headers', 'footers', 'quotes'))['data']
The error in Colab is:
HTTPError: HTTP Error 403: Forbidden
What wrong am I doing?
I tried the code in Colab, but the code is not running. I am not aware of the access permission though. But it can be manually downloaded. In that case, what will be the parameters subset = 'all', remove = ('headers', 'footers', 'quotes'))['data']?
This was working 13th June 2024 but not today 14th. You can download the file by using your browser but the get request is producing a 403 from COLAB.
I took the fetch_20newsgroups logic and hardcoded the browser download file and bypassed the code
archive_path = _fetch_remote(
ARCHIVE, dirname=target_dir
)
and it unpacked everything as expected.
So if we isolate the code
ARCHIVE = RemoteFileMetadata(
filename="20news-bydate.tar.gz",
url="https://ndownloader.figshare.com/files/5975967",
checksum="8f1b2514ca22a5ade8fbb9cfa5727df95fa587f4c87b786e15c759fa66d95610",
)
XX = _fetch_remote(ARCHIVE)
/usr/lib/python3.10/urllib/request.py in http_error_default(self, req, fp,
code, msg, hdrs)
641 class HTTPDefaultErrorHandler(BaseHandler):
642 def http_error_default(self, req, fp, code, msg, hdrs):
--> 643 raise HTTPError(req.full_url, code, msg, hdrs, fp)
644
645 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 403: Forbidden
So we can drill down to
import urllib
import requests
response =
requests.get("https://ndownloader.figshare.com/files/5975967",headers=headers)
print (response.text)
and get the same
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>