I have a list of URL in a text file and I need using Python 3 run a function so that the URL's would match the format of https://www.google.com/images/
An example of the list:
http://www.google.com/images/<text>
https://ca.google.com/images/<text>
https://www.google.com/images/<text>
http://uk.google.com/images/<text>
https://www.google.com/images/<text>
I would need to make a script that would read through the file, clean the URL so for example the URL http://www.google.com/images/ will change to https://www.google.com/images/ and would replace the country code with www
as well. So, if it is http://ca.google.com
It should change to https://www.google.com
May I ask what tools should I use to detect incorrect URL's so I could locate them, fix them and save to the file?
Any help will be appreciated, thank you!
Current code:
urls = open("urls.txt", "r", encoding='utf-8')
urls = [item.replace('http://', 'https://') for item in urls]
for item in urls:
if not 'www' in item:
old_item = item
v = str(item[8:10])
new_item = item.replace(v, 'www')
urls.append(new_item)
urls.remove(old_item)
print(urls)
Since strings are immutable in python we can't change alphabets in them but make new strings, hence the slight complication. First we remove the http
elements. Then we check if www
is present in the link or not. If not we replace the country code(two alphabets) with www
list1 = ['http://www.google.com/images', 'https://ca.google.com/images','https://www.google.com/images','http://uk.google.com/images',
'https://www.google.com/images']
list1 = [item.replace('http://', 'https://') for item in list1]
for item in list1:
if not 'www' in item:
old_item = item
v = str(item[8:10])
new_item = item.replace(v, 'www')
list1.append(new_item)
list1.remove(old_item)
print(list1)
Output:
['https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images']