Search code examples
pythonbeautifulsoupurllib2

Getting URLs from Page and also the next pages


I am trying to get all the url links from the page. I am using this link

https://www.horizont.net/suche/?OK=suchen&OK=suchen&i_sortfl=pubdate&i_sortd=desc&i_q=der

This link is based on search query which shows different articles. There are about 9 articles in each page. So i would like to get all the URLs links as a list from the page.

The second step i want to try, when all the links from the page are extracted from the page then it automatically opens the second page and fetch all the links from there.

enter image description here

So, s there are about 15194 pages so i would like to get all hyperlinks for articles from the pages.

So far i am trying to do this:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

def getLinks(url):
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page)
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))

    return links

print( getLinks("https://www.horizont.net/suche/?OK=suchen&OK=suchen&i_sortfl=pubdate&i_sortd=desc&i_q=der") )

The porblem i am facing now is that i am getting every url from the website but i need only which are search results and also for next pages from the search results.


Solution

  • You can use the element class attribute of the link you need to extract the href:

    for link in soup.findAll ('a', attrs = {'href': re.compile ("^ http: //")}, class _ = "ArticleTeaserSearchResultItem_link"):
    

    And if you are going to browse all the pages and collect all the url of the articles, I can advise you to change the Page value in the link itself until the link is valid:

    i = 1
    urls = []
    while True:
    
        url = f"https://www.horizont.net/suche/?OK=1&i_q=der&i_sortfl=pubdate&i_sortd=desc&currPage={i}"
        try:
            def getLinks(url):
                html_page = urllib2.urlopen(url)
                soup = BeautifulSoup(html_page)
                links = []
    
                for link in soup.findAll('a', attrs={'href': re.compile("^http://")}, class_="ArticleTeaserSearchResultItem_link"):
                    links.append(link.get('href'))
    
                return links
    
        urls.append(getLinks(url))
    
        except:
            break
    
        i += 1
    

    At this time I haven't opportunity to debug my code but I hope I helped you. Good luck!