python python-3.x url normalization normalize

Python 3 clean and normalize URL list

I have a list of URL in a text file and I need using Python 3 run a function so that the URL's would match the format of https://www.google.com/images/

An example of the list:

http://www.google.com/images/<text>
https://ca.google.com/images/<text>
https://www.google.com/images/<text>
http://uk.google.com/images/<text>
https://www.google.com/images/<text>

I would need to make a script that would read through the file, clean the URL so for example the URL http://www.google.com/images/ will change to https://www.google.com/images/ and would replace the country code with www as well. So, if it is http://ca.google.com It should change to https://www.google.com

May I ask what tools should I use to detect incorrect URL's so I could locate them, fix them and save to the file?

Any help will be appreciated, thank you!

Current code:

urls = open("urls.txt", "r", encoding='utf-8')
urls = [item.replace('http://', 'https://') for item in urls]
for item in urls:
    if not 'www' in item:
        old_item = item
        v = str(item[8:10])
        new_item = item.replace(v, 'www')
        urls.append(new_item)
        urls.remove(old_item)
print(urls)

Solution

Since strings are immutable in python we can't change alphabets in them but make new strings, hence the slight complication. First we remove the http elements. Then we check if www is present in the link or not. If not we replace the country code(two alphabets) with www

list1 = ['http://www.google.com/images', 'https://ca.google.com/images','https://www.google.com/images','http://uk.google.com/images',
'https://www.google.com/images']
list1 = [item.replace('http://', 'https://') for item in list1]
for item in list1:
    if not 'www' in item:
        old_item = item
        v = str(item[8:10])
        new_item = item.replace(v, 'www')
        list1.append(new_item)
        list1.remove(old_item)

print(list1)

Output: ['https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images']