Search code examples
pythonpython-3.xurlnormalizationnormalize

Cleaning URLs and saving them to txt file Python3


I am trying to clean and normalize URLs in a text file.

Here is my current code:

import re

with open("urls.txt", encoding='utf-8') as f:
    content = f.readlines()
content = [x.strip() for x in content]

url_format = "https://www.google"
for item in content:
    if not item.startswith(url_format):
        old_item = item
        new_item = re.sub(r'.*google', url_format, item)
        content.append(new_item)
        content.remove(old_item)

with open('result.txt', mode='wt', encoding='utf-8') as myfile:
    myfile.write('\n'.join(content))

The issue is that if I print the old and new items in the loop, it shows me that each URL has been cleaned. But when I print my list of URLs outside of the loop, the URLs are still not cleaned, some of them get removed and some of them do not.

May I ask why the bad URLs still is inside the list when I remove them in my for loop and add the cleaned URL? Perhaps this should be resolved in a different way?

Also, I have noticed that with a big set of URLs it takes a lot of time for the code to run, perhaps I should use different tools?

Any help will be appreciated.


Solution

  • That is because you removing items from the list while iterating over it, which is a bad thing to do, you could either create another list that has the new values and append to it, or modify the list in-place using indexing, you could also just use a list comprehension for this task:

    content = [item if item.startswith(url_format) else re.sub(r'.*google', url_format, item) for item in content]
    

    Or, using another list:

    new_content = []
    
    for item in content:
        if item.startswith(url_format):
            new_content.append(item)
        else:
            new_content.append(re.sub(r'.*google', url_format, item))
    

    Or, modifying the list in-place, using indexing:

    for i, item in enumerate(content):
        if not item.startswith(url_format):
            content[i] = re.sub(r'.*google', url_format, item)