From a linkedin site like: https://www.linkedin.com/company/10073529?trk=tyah&trkInfo=clickedVertical%3Acompany%2CclickedEntityId%3A10073529%2Cidx%3A1-1-1%2CtarId%3A1461132316737%2Ctas%3Adastrong%20
I am trying to retrieve
the link associated with data-li-miniprofile-id
a class="new-miniprofile-container" href="..." data-li-url="..." data-li-miniprofile-id="...>
which has parents of , under , under , etc...
This is what my code looks thus far:
import requests
from bs4 import beautifulsoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
for link in soup.find_all("a"):
print(link.get('href'))
I initially just looked for a class="new-miniprofile-container" but it returned an empty array. I think the reason is that when I ran soup.prettify() (which returns all of the html scraped data), it just doesn't contain any children content after
I feel the problem is associated with the security blocks set up by LinkedIn engineers, but I want to know if there is a way to get those URLs, or if there are any other options to get those.
You should be using the LinkedIn REST API instead. There are the relevant company profile related endpoints and you can experiment with the REST API explorer here. And there is a python-linkedin
client, which also has the Company API part documented.