Search code examples
pythonlda

LDA in Python shows 403 error in fetching 20newsgroups dataset


While trying to run a Latent Dirichlet Allocation code in Python, the following line is having an error:

docs = fetch_20newsgroups(subset = 'all',  remove = ('headers', 'footers', 'quotes'))['data']

The error in Colab is: HTTPError: HTTP Error 403: Forbidden

What wrong am I doing?

I tried the code in Colab, but the code is not running. I am not aware of the access permission though. But it can be manually downloaded. In that case, what will be the parameters subset = 'all', remove = ('headers', 'footers', 'quotes'))['data']?


Solution

  • This was working 13th June 2024 but not today 14th. You can download the file by using your browser but the get request is producing a 403 from COLAB.

    I took the fetch_20newsgroups logic and hardcoded the browser download file and bypassed the code

    archive_path = _fetch_remote(
            ARCHIVE, dirname=target_dir
        )
    

    and it unpacked everything as expected.

    So if we isolate the code

    ARCHIVE = RemoteFileMetadata(
        filename="20news-bydate.tar.gz",
        url="https://ndownloader.figshare.com/files/5975967",
        
       checksum="8f1b2514ca22a5ade8fbb9cfa5727df95fa587f4c87b786e15c759fa66d95610",
    )
    XX = _fetch_remote(ARCHIVE)
    
    /usr/lib/python3.10/urllib/request.py in http_error_default(self, req, fp, 
     code, msg, hdrs)
        641 class HTTPDefaultErrorHandler(BaseHandler):
        642     def http_error_default(self, req, fp, code, msg, hdrs):
    --> 643         raise HTTPError(req.full_url, code, msg, hdrs, fp)
        644 
        645 class HTTPRedirectHandler(BaseHandler):
    
    HTTPError: HTTP Error 403: Forbidden
    

    So we can drill down to

    import urllib 
    import requests
    response = 
    requests.get("https://ndownloader.figshare.com/files/5975967",headers=headers)
    print (response.text)
    

    and get the same

    <html>
    <head><title>403 Forbidden</title></head>
    <body>
    <center><h1>403 Forbidden</h1></center>
    </body>
    </html>
    

    Screenshot