Search code examples
python-3.xparsingweb-scrapinghyperlinkhref

How to open "partial" links using Python?


I'm working on a webscraper that opens a webpage, and prints any links within that webpage if the link contains a keyword (I will later open these links for further scraping).

For example, I am using the requests module to open "cnn.com", and then trying to parse out all href/links within that webpage. Then, if any of the links contain a specific word (such as "china"), Python should print that link.

I could just simply open the main page using requests, save all href's onto a list ('links'), and then use:

links = [...]

keyword = "china"

for link in links:
   if keyword in link:
      print(link)

However, the problem with this method is that the links that I originally parsed out aren't full links. For example, all links with CNBC's webpage are structured like this:

href="https://www.cnbc.com/2019/08/11/how-recession-affects-tech-industry.html"

But for CNN's page, they're written like this (not full links... they're missing the part that comes before the "/"):

href="/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"

This is a problem because I'm writing more script to automatically open these links to parse them. But Python can't open

"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"

because it isn't a full link.

So, what is a robust solution to this (something that works for other sites too, not just CNN)?

EDIT: I know the links I wrote as an example in this post don't contain the word "China", but this these are just examples.


Solution

  • Try using the urljoin function from the urllib.parse package. It takes two parameters, the first is the URL of the page you're currently parsing, which serves as the base for relative links, the second is the link you found. If the link you found starts with http:// or https://, it'll return just that link, else it will resolve URL relative to what you passed as the first parameter.

    So for example:

    #!/usr/bin/env python3
    
    from urllib.parse import urljoin
    
    print(
      urljoin(
        "https://www.cnbc.com/",
        "/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
      )
    )
    # prints "https://www.cnbc.com/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
    
    print(
      urljoin(
        "https://www.cnbc.com/",
        "http://some-other.website/"
      )
    )
    # prints "http://some-other.website/"