Search code examples
pythonweb-scrapingbeautifulsouphref

Beautifulsoup webscraping - extract values from within <a> tags


I am using bs4 to scrape indeed.com for jobs (link here). I have been able to successfully extract the title, company and summary without any issues.

Now I want to take this a step further and extract the sublink provided when the user clicks on each of the job titles, e.g. this link, so I can extract additional information such as specific technologies required for each role.

I noticed that the sublink url contains jk=e5b27ae62cb7c45f, which is the job key contained in the html that I can use to create the sublink. Unfortunately, I am struggling to extract this job key!

I have identified the job key is nested in an <a> tag, labeled data-jk:

enter image description here

Here is my function so far using bs4:

def transform(soup):
    divs = soup.find_all('a', class_= 'tapItem')
    
    for item in divs:
        title = item.find('h2', {'class':'jobTitle-color-purple'}).text
        id = item.find('a')
        
        print(title)
        print(id)
        
transform(soup)

Which returns the following results:

newDevOps Engineer
<a class="turnstileLink companyOverviewLink" data-tn-element="companyName" href="/cmp/Tata-Consultancy-Services-(tcs)" rel="noopener" target="_blank">Tata Consultancy Services (TCS)</a>
newDevOps Engineer - Sydney, Australia
None
DevOps Engineers
<a class="turnstileLink companyOverviewLink" data-tn-element="companyName" href="/cmp/CGI" rel="noopener" target="_blank">CGI</a>
Graduate Software Developer/Programmer (DevOps)
<a class="turnstileLink companyOverviewLink" data-tn-element="companyName" href="/cmp/Tata-Consultancy-Services-(tcs)" rel="noopener" target="_blank">Tata Consultancy Services (TCS)</a>
Cloud Engineer (Entry level, AWS training provided)

As you can see, I am able to extract title successfully but not id, since I do not know how to select the data-jk value from within the <a> tag. I am also confused as to why the <a> tag with class: 'tapItem' does not even appear when I call item.find('a')?

I've scoured stackoverflow but am unable to find a similar question to mine. Hoping someone here can help me figure this out!


Solution

  • title doesn't have directly href but all offer is inside <a> which I can even see on your image - at the top of black background (with target="_blank")

    You get job_seen_beacon which is also inside this <a> so you can't access this <a>. If you start few tags above then you can get <a> and href`

    #divs = soup.find_all('div', class_ = 'job_seen_beacon')
    
    divs = soup.find('div', {'id': 'mosaic-provider-jobcards'}).find_all('a', {'class': 'result'})
    
    for item in divs:
    
        link = item['href']
    

    Full working example

    import requests
    from bs4 import BeautifulSoup
    
    #extract
    def extract(url):
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
        r = requests.get(url, headers)
        soup = BeautifulSoup(r.content, 'html.parser')
        return soup
        
    #transform
    def transform(soup):
        #divs = soup.find_all('div', class_ = 'job_seen_beacon')
        divs = soup.find('div', {'id': 'mosaic-provider-jobcards'}).find_all('a', {'class': 'result'})
    
        joblist = []
       
        for item in divs:
            title = item.find('h2', {'class':'jobTitle-color-purple'}).text
            company = item.find('span', {'class': 'companyName'}).text
            summary = item.find('div', {'class': 'job-snippet'}).text.replace('\n','')
    
            link = item['href']
            print(link)
    
            job = {
                'title': title,
                'company': company,
                'summary': summary,
                'link': link,
            }
            joblist.append(job)
            #print(job)
            print('---')
            
        return joblist
    
    soup = extract('https://uk.indeed.com/jobs?q=devops&start=0')
    joblist = transform(soup)
    
    #print(joblist)