LDA in Python shows 403 error in fetching 20newsgroups dataset

While trying to run a Latent Dirichlet Allocation code in Python, the following line is having an error:

docs = fetch_20newsgroups(subset = 'all',  remove = ('headers', 'footers', 'quotes'))['data']

The error in Colab is: HTTPError: HTTP Error 403: Forbidden

What wrong am I doing?

I tried the code in Colab, but the code is not running. I am not aware of the access permission though. But it can be manually downloaded. In that case, what will be the parameters subset = 'all', remove = ('headers', 'footers', 'quotes'))['data']?

Solution

This was working 13th June 2024 but not today 14th. You can download the file by using your browser but the get request is producing a 403 from COLAB.

I took the fetch_20newsgroups logic and hardcoded the browser download file and bypassed the code

archive_path = _fetch_remote(
        ARCHIVE, dirname=target_dir
    )

and it unpacked everything as expected.

So if we isolate the code

ARCHIVE = RemoteFileMetadata(
    filename="20news-bydate.tar.gz",
    url="https://ndownloader.figshare.com/files/5975967",
    
   checksum="8f1b2514ca22a5ade8fbb9cfa5727df95fa587f4c87b786e15c759fa66d95610",
)
XX = _fetch_remote(ARCHIVE)

/usr/lib/python3.10/urllib/request.py in http_error_default(self, req, fp, 
 code, msg, hdrs)
    641 class HTTPDefaultErrorHandler(BaseHandler):
    642     def http_error_default(self, req, fp, code, msg, hdrs):
--> 643         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    644 
    645 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

So we can drill down to

import urllib 
import requests
response = 
requests.get("https://ndownloader.figshare.com/files/5975967",headers=headers)
print (response.text)

and get the same

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>