Search code examples
pythonbeautifulsoupurllib2

Problem with data extraction from Indeed by BeautifulSoup


I'm trying to extract job descriptions for each post from Indeed website but, the result is not what I expected!

I've written a code to get job descriptions. I'm working with python 2.7 and the latest beautifulsoup. When you open the page and click on each job title, you will see the related information on the right side of the screen. I need to extract those job descriptions for each job on this page. My Code:

import sys

import urllib2 

from BeautifulSoup import BeautifulSoup

url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston%2C%20TX&vjk=8000b2656aae5c08"

html = urllib2.urlopen(url).read()

soup = BeautifulSoup(html)

N = soup.findAll("div", {"id" : "vjs-desc"})

print N

I expected to see the results but instead, I got [] as the result. Is it because the Id is non-unique. If so, how should I edit the code?


Solution

  • the #vjs-desc element is generated by javascript and the content are from Ajax request. To get the description you need to do that request.

    # -*- coding: utf-8 -*-
    
    # it easier to create http request/session using this
    import requests
    import re, urllib
    from BeautifulSoup import BeautifulSoup
    
    url = "https://www......"
    
    # create session
    s = requests.session()
    html = s.get(url).text
    
    # exctract job IDs
    job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
    ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)
    # do Ajax request and convert the response to json 
    ajax_content = s.get(ajax_url).json()
    print(ajax_content)
    
    for id, desc in ajax_content.items():
        print id
        soup = BeautifulSoup(desc, 'html.parser')
        # or try this
        # soup = BeautifulSoup(desc.decode('unicode-escape'), 'html.parser')
        print soup.text.encode('utf-8')
        print('==============================')