I am new to Python and I hope someone on here can help me. I am building a program as part of my learning to scrape information from linkedin job adverts. So far it has gone well however seem to have hit a brick wall with this particular issue. I am attempting to scrape the full job description, including the qualifications. I have identified the xpath for the description and am able to reference this via the following:
desc_xpath = '/html/body/main/section/div[2]/section[2]/div'
This gives me nearly all of the job description information, however does not include the qualifications section of a linkedin job profile. I extract the high level, wordy element of each job profile, however the further drill downs such as responsibilities, qualifications, extra qualifications do not seem to get pulled by this reference.
Is anybody able to help?
Kind regards
D
Example Code
driver.get('https://www.linkedin.com/jobs/view/etl-developer-at-barclays-2376164866/?utm_campaign=google_jobs_apply&utm_source=google_jobs_apply&utm_medium=organic&originalSubdomain=uk')
time.sleep(3)
#job description
jobdesc_xpath = '/html/body/main/section[1]/section[3]/div/section/div'
job_descs = driver.find_element_by_xpath(jobdesc_xpath).text
print(job_descs) ```
Selenium struggles to get the text located in different sub-tags. You could try to use an html parser, such as BeautifulSoup. Try this:
from bs4 import BeautifulSoup
url = 'https://www.linkedin.com/jobs/view/etl-developer-at-barclays-2376164866/?utm_campaign=google_jobs_apply&utm_source=google_jobs_apply&utm_medium=organic&originalSubdomain=uk'
driver.get(url)
#Find the job description
job_desc = driver.find_element_by_xpath('//div[@class="show-more-less-html__markup show-more-less-html__markup--clamp-after-5"]')
#Get the html of the element and pass into BeautifulSoup parser
soup = BeautifulSoup(job_desc.get_attribute('outerHTML'), 'html.parser')
#The parser will print each paragraph on the same line. Use 'separator = \n' to print each each paragraph on a new line and '\n\n' to print an empty line between paragraphs
soup.get_text(separator='\n\n')