Search code examples
pythonhtmlweb-scrapingbeautifulsouplinkedin-api

LinkedIn scraping not getting all data


From a linkedin site like: https://www.linkedin.com/company/10073529?trk=tyah&trkInfo=clickedVertical%3Acompany%2CclickedEntityId%3A10073529%2Cidx%3A1-1-1%2CtarId%3A1461132316737%2Ctas%3Adastrong%20

I am trying to retrieve

the link associated with data-li-miniprofile-id

a class="new-miniprofile-container" href="..." data-li-url="..." data-li-miniprofile-id="...>

which has parents of , under , under , etc...

This is what my code looks thus far:

import requests
from bs4 import beautifulsoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
for link in soup.find_all("a"):
    print(link.get('href'))

I initially just looked for a class="new-miniprofile-container" but it returned an empty array. I think the reason is that when I ran soup.prettify() (which returns all of the html scraped data), it just doesn't contain any children content after

I feel the problem is associated with the security blocks set up by LinkedIn engineers, but I want to know if there is a way to get those URLs, or if there are any other options to get those.


Solution

  • You should be using the LinkedIn REST API instead. There are the relevant company profile related endpoints and you can experiment with the REST API explorer here. And there is a python-linkedin client, which also has the Company API part documented.