I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/roster.era?CID=102353 and I am able to do it for the names beginning with ANANDASABAPATHY which are contained inside a "p" tag:
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("p")
and then playing with the length of the element:
for bullet in column:
if len(bullet.find_all("br"))==4:
person = {}
person["NAME"]=bullet.contents[0].strip()
person["PROFESSION"]=bullet.contents[2].strip()
person["DEPARTMENT"]=bullet.contents[4].strip()
person["INSTITUTION"]=bullet.contents[6].strip()
person["LOCATION"]=bullet.contents[8].strip()
However, I have 2 issues.
soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()
but it is not working
Any help would be extremely useful! Thanks in advance!
This is a case where it may be easier to handle processing the data more as plain text than as HTML, after initially extracting the element you're looking for. The reason is that the HTML is not very well formatted for parsing / it doesn't follow a very uniform pattern. The html5lib package generally handles poorly formatted html better than html.parser
, but it didn't help significantly in this case.
import re
from typing import Collection, Iterator
from bs4 import BeautifulSoup
def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -> Iterator[str]:
for sibling in soup.find('b').next_siblings:
for block in sibling.stripped_strings:
block_str = ' '.join(filter(None, (line.strip() for line in block.split('\n'))))
if block_str and block_str not in ignore:
yield block_str
def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -> list[list[str]]:
zip_code_pattern = re.compile(r', \d+$')
people = []
person = []
for line in iter_lines(soup, ignore):
person.append(line)
if zip_code_pattern.search(line):
people.append(person)
person = []
return people
def normalize_person(raw_person: list[str]) -> dict[str, str | None]:
return {
'NAME': raw_person[0],
'PROFESSION': raw_person[1] if len(raw_person) > 4 else None,
'DEPARTMENT': next((line for line in raw_person if 'DEPARTMENT' in line), None),
'INSTITUTION': raw_person[-2],
'LOCATION': raw_person[-1],
}
raw_people = group_people(soup, ignore={'SCIENTIFIC REVIEW OFFICER'})
normalized = [normalize_person(person) for person in raw_people]
This works with both BeautifulSoup(content, 'html.parser')
and BeautifulSoup(content, 'html5lib')
.
The iter_lines
function finds the first <b>
tag like you did before, and then yields a single string for each line that is displayed in a browser.
The group_people
function groups the lines into separate people, using the zip code at the end to indicate that that person's entry is complete. It may be possible to combine this function with iter_lines
and skip the regex, but this was slightly easier. Better formatted html would be more conducive to that approach.
The ignore
parameter was used to skip the SCIENTIFIC REVIEW OFFICER
header above the last person on that page.
Lastly, the normalize_person
function attempts to interpret what each line for a given person means. The name, institution, and location appear to be fairly consistent, but I took some liberties with profession and department to use None
when it seemed like a value did not exist. Those decisions were only made based on the particular page you linked to - you may need to adjust those for other pages. It uses negative indexes for the institution and location because the number of lines that existed for each person's data was variable.