I have saved a website's HTML code in a .txt
file on my computer. I would like to extract all URLs from this text file using the following code:
def get_net_target(page):
start_link=page.find("href=")
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
url=page[start_quote+1:end_quote]
return url
my_file = open("test12.txt")
page = my_file.read()
print(get_net_target(page))
However, the script only prints the first URL, but not all other links. Why is this?
You need to implement a loop to go through all URLs.
print(get_net_target(page))
only prints the first URL found in page
, so you will need to call this function again and again, each time replacing page
by the substring page[end_quote+1:]
until no more URL is found.
To get you started, next_index
will store the last ending URL position, then the loop will retrieve the following URLs:
next_index = 0 # the next page position from which the URL search starts
def get_net_target(page):
global next_index
start_link=page.find("href=")
if start_link == -1: # no more URL
return ""
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
next_index=end_quote
url=page[start_quote+1:end_quote]
end_quote=5
return url
my_file = open("test12.txt")
page = my_file.read()
while True:
url = get_net_target(page)
if url == "": # no more URL
break
print(url)
page = page[next_index:] # continue with the page
Also be careful because you only retrieve links which are enclosed inside "
, but they can be enclosed by '
or even nothing...