I'm looking to scrape a website for job description information, but I seem to only be getting only unrelated text. Here's the soup object creation:
url = 'https://www.glassdoor.com/Job/boston-full-stack-engineer-jobs-SRCH_IL.0,6_IC1154532_KO7,26.htm?jl=3188635682&guid=0000016a8432102e99e9b5232325d3d5&pos=102&src=GD_JOB_AD&srs=MY_JOBS&s=58&ao=599212'
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
soup = bs4.BeautifulSoup(urlopen(req),"html.parser")
divliul=soup.body.findAll(['div','li','ul'])
for i in divliul:
if i.string is not None:
print(i.string)
If you browse the website for a second, you'll see that the soup seems to only contain elements from the left hand column and nothing from the job description containers. I thought this might be a urllib request issue, but I have tried just downloading the html file and reading it that way and the results are similar. output:
Jobs
Company Reviews
Company Reviews
Companies near you
Best Buy Reviews in Boston
Target Reviews in Boston
IBM Reviews in Boston
AT&T Reviews in Boston
The Home Depot Reviews in Boston
Walmart Reviews in Boston
Macy's Reviews in Boston
Microsoft Reviews in Boston
Deloitte Reviews in Boston
Amazon Reviews in Boston
Bank of America Reviews in Boston
Wells Fargo Reviews in Boston
Company Culture
Best Places to Work
12 Companies That Will Pay You to Travel the World
7 Types of Companies You Should Never Work For
20 Companies Hiring for the Best Jobs In America
How to Become the Candidate Recruiters Can’t Resist
13 Companies With Enviable Work From Home Options
New On Glassdoor
Salaries
Interviews
Salary Calculator
Account Settings
Account Settings
Account Settings
Account Settings
empty notification btn
My Profile
Saved Jobs
Email & Alerts
Contributions
My Resumes
Company Follows
Account
Help / Contact Us
Account Settings
Account Settings
Account Settings
empty notification btn
For Employers
For Employers
Unlock Employer Account
Unlock Employer Account
Post a Job
Post a Job
Employer Branding
Job Advertising
Employer Blog
Talk to Sales
Post Jobs Free
Full Stack Engineer Jobs in Boston, MA
Jobs
Companies
Salaries
Interviews
Full Stack Engineer
EASY APPLY
EASY APPLY
Full Stack Engineer | Noodle.com
EASY APPLY
EASY APPLY
Full Stack Engineer
Hot
Software Engineer
EASY APPLY
EASY APPLY
Senior Software Engineer
EASY APPLY
EASY APPLY
We're Hiring
We're Hiring
Full Stack Engineer
Hot
Software Engineer
Hot
Hot
Full Stack Engineer
We're Hiring
Full Stack Software Engineer
EASY APPLY
EASY APPLY
We're Hiring
We're Hiring
Software Engineer
New
New
Full Stack Engineer
EASY APPLY
EASY APPLY
We're Hiring
We're Hiring
Pre-Sales Engineer / Full-Stack Developer
Top Company
Top Company
Full Stack Software Engineer
Software Engineer
Top Company
Top Company
Associate Software Engineer
Full Stack Software Engineer
Software Engineer
New
New
Mid-level Full Stack Software Engineer (Java/React
EASY APPLY
EASY APPLY
Junior Software Engineer - Infrastructure
Software Engineer
Software Engineer
New
New
Associate Software Engineer
C# Engineer - Full Stack
EASY APPLY
EASY APPLY
Software Engineer, Platform
Software Engineer
EASY APPLY
EASY APPLY
Software Engineer
Associate Software Engineer
Software Engineer
Software Engineer
Software Engineer - Features
EASY APPLY
EASY APPLY
Page 1 of 81
Previous
1
2
3
4
5
Next
People Also Searched
Top Cities for Full Stack Engineer:
Top Companies for full stack engineer in Boston, MA:
Help / Contact Us
Terms of Use
Privacy & Cookies (New)
Copyright © 2008–2019, Glassdoor, Inc. "Glassdoor" and logo are proprietary trademarks of Glassdoor, Inc.
Email me jobs for:
Create a Job Alert
Your job alert has been created.
Create more job alerts for related jobs with one click:
There are ids you can extract from that page and concatenate into an url that the page uses to retrieve json which populates the card on right as you scroll. Handle the json to extract what ever info you want.
Finding the urls - the right hand side updates content as you scroll down on the left, so I went hunting in the network tab for the activity associated with the update. When I saw the new urls, generated during scrolling, it looked like there were common strings and parts that varied i.e. likely a querystring format. I guessed that the parts that varied came from the page (and some looked like generated ids we could keep static/ignore - an experience based assumption I tested). I went hunting in the html for what I expected were the important identifiers for differentiating jobs to the server i.e. the two sets of ids. You take either of the two ids being concatenated in url string from the network tab and press Ctrl + F to search the page HTML for them; you will see where these values come from.
from bs4 import BeautifulSoup as bs
import requests
import re
results = []
with requests.Session() as s:
url = 'https://www.glassdoor.co.uk/Job/json/details.htm?pos=&ao={}&s=58&guid=0000016a88f962649d396c5b606d567b&src=GD_JOB_AD&t=SR&extid=1&exst=OL&ist=&ast=OL&vt=w&slr=true&cs=1_1d8f42ad&cb=1557076206569&jobListingId={}&gdToken=uo8hehXn6nNuwhjMyBW14w:3RBFWgOD-0e7hK8o-Fgo0bUtD6jw5wJ3UujVq6L-v0ux9mlLjMxjW8-KF9xsDk41j7I11QHOHgcj9LBoWYaCxg:wAFOqHzOjgAxIGQVmbyibsaECrQO-HWfxb8Ugq-x_tU'
headers = {'User-Agent' : 'Mozilla/5.0'}
r = s.get('https://www.glassdoor.co.uk/Job/boston-full-stack-engineer-jobs-SRCH_IL.0,6_IC1154532_KO7,26.htm?jl=3188635682&s=58&pos=102&src=GD_JOB_AD&srs=MY_JOBS&guid=0000016a8432102e99e9b5232325d3d5&ao=599212&countryRedirect=true', headers = headers)
soup = bs(r.content, 'lxml')
ids = [item['data-ad-order-id'] for item in soup.select('[data-ad-order-id]')]
p1 = re.compile(r"jobIds':\[(.*)'segmentType'", re.DOTALL)
init = p1.findall(r.text)[0]
p2 = re.compile(r"(\d{10})")
job_ids = p2.findall(init)
loop_var = list(zip(ids, job_ids))
for x, y in loop_var:
data = s.get(url.format(x,y), headers = headers).json()
results.append(data)