I am trying to get all the url links from the page. I am using this link
https://www.horizont.net/suche/?OK=suchen&OK=suchen&i_sortfl=pubdate&i_sortd=desc&i_q=der
This link is based on search query which shows different articles. There are about 9 articles in each page. So i would like to get all the URLs links as a list from the page.
The second step i want to try, when all the links from the page are extracted from the page then it automatically opens the second page and fetch all the links from there.
So, s there are about 15194 pages so i would like to get all hyperlinks for articles from the pages.
So far i am trying to do this:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
def getLinks(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
print( getLinks("https://www.horizont.net/suche/?OK=suchen&OK=suchen&i_sortfl=pubdate&i_sortd=desc&i_q=der") )
The porblem i am facing now is that i am getting every url from the website but i need only which are search results and also for next pages from the search results.
You can use the element class attribute of the link you need to extract the href:
for link in soup.findAll ('a', attrs = {'href': re.compile ("^ http: //")}, class _ = "ArticleTeaserSearchResultItem_link"):
And if you are going to browse all the pages and collect all the url of the articles, I can advise you to change the Page value in the link itself until the link is valid:
i = 1
urls = []
while True:
url = f"https://www.horizont.net/suche/?OK=1&i_q=der&i_sortfl=pubdate&i_sortd=desc&currPage={i}"
try:
def getLinks(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}, class_="ArticleTeaserSearchResultItem_link"):
links.append(link.get('href'))
return links
urls.append(getLinks(url))
except:
break
i += 1
At this time I haven't opportunity to debug my code but I hope I helped you. Good luck!