Search code examples
pythonhttppython-requestshttp-status-code-404http-status-codes

How to make Python go through URLs in a text file, check their status codes, and exclude all ones with 404 error?


I tried the following script, but unfortunately the output file is identical to the input file. I'm not sure what's wrong with it.

import requests

url_lines = open('banana1.txt').read().splitlines()

remove_from_urls = []

for url in url_lines:
    remove_url = requests.get(url)
    print(remove_url.status_code)
    if remove_url.status_code == 404:
        remove_from_urls.append(url)
        continue
        
url_lines = [url for url in url_lines if url not in remove_from_urls]
print(url_lines)

# Save urls example
with open('banana2.txt', 'w+') as file:
    for item in url_lines:
        file.write(item + '\n')

Solution

  • There seems to be no error in your code, but there are few things that would help to make it more readable and consistent. The first course of action should be to make sure there is at least one url that would return a 404 status code.

    Edit: After providing the actual URL.

    The 404 problem

    In your case, the problem is the Twitter actually does not return 404 error for your "Not found" url. You can test it using curl:

    $ curl -o /dev/null -w "%{http_code}" "https://twitter.com/davemeltzerWON/status/1321279214365016064"
    200
    

    Or using Python:

    import requests
    response = requests.get("https://twitter.com/davemeltzerWON/status/1321279214365016064")
    print(response.status_code)
    

    The output for both should be 200.

    Since Twitter is a JavaScript application that loads its content after it has been processed in browser, you cannot find the information you are looking for in the HTML response. You would need to use something like Selenium to actually process the JavaScript for you and then you would be able to look for actual text like "not found" on the web page.

    Code review

    Please make sure to close the file properly. Also, file object is a lines iterator, you can convert it to list very easily. Another trick to make the code more readable is to make use of Python set. So you may read the file like this:

    with open("banana1.txt") as fid:
        url_lines = set(fid)
    

    Then you simply remove all the links that do not work:

    not_working = set()
    for url in url_lines:
        if requests.get(url).status_code == 404:
            not_working.add(url)
    
    working = url_lines - not_working
    
    with open("banana2.txt", "w") as fid:
        fid.write("\n".join(working))
    

    Also, if some of the links point to the same server, you should make use of requests.Session class:

    from requests import Session
    session = Session()
    

    Then replace requests.get with session.get, you should get some performance boost since the Session uses keep-alive connection and many other things.