Search code examples
pythonhtmlurlhtml-content-extraction

How to properly extract URLs from HTML code?


I have saved a website's HTML code in a .txt file on my computer. I would like to extract all URLs from this text file using the following code:

def get_net_target(page):
    start_link=page.find("href=")
    start_quote=page.find('"',start_link)
    end_quote=page.find('"',start_quote+1)
    url=page[start_quote+1:end_quote]
    return url
my_file = open("test12.txt")
page = my_file.read()
print(get_net_target(page))

However, the script only prints the first URL, but not all other links. Why is this?


Solution

  • You need to implement a loop to go through all URLs.

    print(get_net_target(page)) only prints the first URL found in page, so you will need to call this function again and again, each time replacing page by the substring page[end_quote+1:] until no more URL is found.

    To get you started, next_index will store the last ending URL position, then the loop will retrieve the following URLs:

    next_index = 0 # the next page position from which the URL search starts
    
    def get_net_target(page):
      global next_index
    
      start_link=page.find("href=")
      if start_link == -1: # no more URL
        return ""
      start_quote=page.find('"',start_link)
      end_quote=page.find('"',start_quote+1)
      next_index=end_quote
      url=page[start_quote+1:end_quote]
      end_quote=5
      return url
    
    
    my_file = open("test12.txt")
    page = my_file.read()
    
    while True:
        url = get_net_target(page)
        if url == "": # no more URL
            break
        print(url)
        page = page[next_index:] # continue with the page
    

    Also be careful because you only retrieve links which are enclosed inside ", but they can be enclosed by ' or even nothing...