For a schoolproject we need to scrape a 'job-finding' website and store this in a DB, and later match with these profiles with companies who are searching people.
On this particular site, all the url's to the pages I need to scrape are in 1 div (with 10 links per page) the div is called 'primaryResults' which has 10 in it.
With beautifulsoup I wish to first scrape all the links in an array by looping through the page number in the url until a 404 or something similar pops up.
Then go through each of these pages, and store the information I need from each page into an array and lastly send this to my DB.
Now I'm getting stuck at the part where I collect the 10 links from the ID = 'primaryResults' div.
How would I go and put this into my Python to make this store all the 10 url's into an array? So far I have tried this:
import urllib2
from BeautifulSoup import BeautifulSoup
opener = urllib2.build_opener()
opener.addheaders = [("User-Agent", "Mozilla/5.0")]
url = ("http://jobsearch.monsterboard.nl/browse/")
content = opener.open(url).read()
soup = BeautifulSoup(content)
soup.find(id="primaryResults")
print soup.find_all('a')
but this only gives an error:
Traceback (most recent call last):
print soup.find_all('a')
TypeError: 'NoneType' object is not callable
Could someone please help me out? Thanks :)
Here is the answer to get all the links that are in the URL that you have mentioned
from bs4 import BeautifulSoup
import urllib2
url="http://jobsearch.monsterboard.nl/browse/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
jobs=soup.findAll('a',{'class':'slJobTitle'})
for eachjob in jobs:
print eachjob['href']
Hope it is clear and helpful.