I'm trying to extract job descriptions for each post from Indeed website but, the result is not what I expected!
I've written a code to get job descriptions. I'm working with python 2.7 and the latest beautifulsoup. When you open the page and click on each job title, you will see the related information on the right side of the screen. I need to extract those job descriptions for each job on this page. My Code:
import sys
import urllib2
from BeautifulSoup import BeautifulSoup
url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston%2C%20TX&vjk=8000b2656aae5c08"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
N = soup.findAll("div", {"id" : "vjs-desc"})
print N
I expected to see the results but instead, I got [] as the result. Is it because the Id is non-unique. If so, how should I edit the code?
the #vjs-desc
element is generated by javascript and the content are from Ajax request. To get the description you need to do that request.
# -*- coding: utf-8 -*-
# it easier to create http request/session using this
import requests
import re, urllib
from BeautifulSoup import BeautifulSoup
url = "https://www......"
# create session
s = requests.session()
html = s.get(url).text
# exctract job IDs
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)
# do Ajax request and convert the response to json
ajax_content = s.get(ajax_url).json()
print(ajax_content)
for id, desc in ajax_content.items():
print id
soup = BeautifulSoup(desc, 'html.parser')
# or try this
# soup = BeautifulSoup(desc.decode('unicode-escape'), 'html.parser')
print soup.text.encode('utf-8')
print('==============================')