Search code examples
pythonurllib2erase

How to delete a line from a file after it has been used


I'm trying to create a script which makes requests to random urls from a txt file e.g.:

import urllib2

with open('urls.txt') as urls:
    for url in urls:
        try:
            r = urllib2.urlopen(url)
        except urllib2.URLError as e:
            r = e
        if r.code in (200, 401):
            print '[{}]: '.format(url), "Up!"

But I want that when some url indicates 404 not found, the line containing the URL is erased from the file. There is one unique URL per line, so basically the goal is to erase every URL (and its corresponding line) that returns 404 not found. How can I accomplish this?


Solution

  • The easiest way is to read all the lines, loop over the saved lines and try to open them, and then when you are done, if any URLs failed you rewrite the file.

    The way to rewrite the file is to write a new file, and then when the new file is successfully written and closed, then you use os.rename() to change the name of the new file to the name of the old file, overwriting the old file. This is the safe way to do it; you never overwrite the good file until you know you have the new file correctly written.

    I think the simplest way to do this is just to create a list where you collect the good URLs, plus have a count of failed URLs. If the count is not zero, you need to rewrite the text file. Or, you can collect the bad URLs in another list. I did that in this example code. (I haven't tested this code but I think it should work.)

    import os
    import urllib2
    
    input_file = "urls.txt"
    debug = True
    
    good_urls = []
    bad_urls = []
    
    bad, good = range(2)
    
    def track(url, good_flag, code):
        if good_flag == good:
            good_str = "good"
        elif good_flag == bad:
            good_str = "bad"
        else:
            good_str = "ERROR! (" + repr(good) + ")"
        if debug:
            print("DEBUG: %s: '%s' code %s" % (good_str, url, repr(code)))
        if good_flag == good:
            good_urls.append(url)
        else:
            bad_urls.append(url)
    
    with open(input_file) as f:
        for line in f:
            url = line.strip()
            try:
                r = urllib2.urlopen(url)
                if r.code in (200, 401):
                    print '[{0}]: '.format(url), "Up!"
                if r.code == 404:
                    # URL is bad if it is missing (code 404)
                    track(url, bad, r.code)
                else:
                    # any code other than 404, assume URL is good
                    track(url, good, r.code)
            except urllib2.URLError as e:
                track(url, bad, "exception!")
    
    # if any URLs were bad, rewrite the input file to remove them.
    if bad_urls:
        # simple way to get a filename for temp file: append ".tmp" to filename
        temp_file = input_file + ".tmp"
        with open(temp_file, "w") as f:
            for url in good_urls:
                f.write(url + '\n')
        # if we reach this point, temp file is good.  Remove old input file
        os.remove(input_file)  # only needed for Windows
        os.rename(temp_file, input_file)  # replace original input file with temp file
    

    EDIT: In comments, @abarnert suggests that there might be a problem with using os.rename() on Windows (at least I think that is what he/she means). If os.rename() doesn't work, you should be able to use shutil.move() instead.

    EDIT: Rewrite code to handle errors.

    EDIT: Rewrite to add verbose messages as URLs are tracked. This should help with debugging. Also, I actually tested this version and it works for me.