python http python-requests http-status-code-404 http-status-codes

How to make Python go through URLs in a text file, check their status codes, and exclude all ones with 404 error?

I tried the following script, but unfortunately the output file is identical to the input file. I'm not sure what's wrong with it.

import requests

url_lines = open('banana1.txt').read().splitlines()

remove_from_urls = []

for url in url_lines:
    remove_url = requests.get(url)
    print(remove_url.status_code)
    if remove_url.status_code == 404:
        remove_from_urls.append(url)
        continue
        
url_lines = [url for url in url_lines if url not in remove_from_urls]
print(url_lines)

# Save urls example
with open('banana2.txt', 'w+') as file:
    for item in url_lines:
        file.write(item + '\n')

Solution

There seems to be no error in your code, but there are few things that would help to make it more readable and consistent. The first course of action should be to make sure there is at least one url that would return a 404 status code.

Edit: After providing the actual URL.

The 404 problem

In your case, the problem is the Twitter actually does not return 404 error for your "Not found" url. You can test it using curl:

$ curl -o /dev/null -w "%{http_code}" "https://twitter.com/davemeltzerWON/status/1321279214365016064"
200

Or using Python:

import requests
response = requests.get("https://twitter.com/davemeltzerWON/status/1321279214365016064")
print(response.status_code)

The output for both should be 200.

Since Twitter is a JavaScript application that loads its content after it has been processed in browser, you cannot find the information you are looking for in the HTML response. You would need to use something like Selenium to actually process the JavaScript for you and then you would be able to look for actual text like "not found" on the web page.

Code review

Please make sure to close the file properly. Also, file object is a lines iterator, you can convert it to list very easily. Another trick to make the code more readable is to make use of Python set. So you may read the file like this:

with open("banana1.txt") as fid:
    url_lines = set(fid)

Then you simply remove all the links that do not work:

not_working = set()
for url in url_lines:
    if requests.get(url).status_code == 404:
        not_working.add(url)

working = url_lines - not_working

with open("banana2.txt", "w") as fid:
    fid.write("\n".join(working))

Also, if some of the links point to the same server, you should make use of requests.Session class:

from requests import Session
session = Session()

Then replace requests.get with session.get, you should get some performance boost since the Session uses keep-alive connection and many other things.