Search code examples
pythonweb-scrapingbeautifulsoup

Python - Scraping text inside <br> which is not under a <p>


I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/roster.era?CID=102353 and I am able to do it for the names beginning with ANANDASABAPATHY which are contained inside a "p" tag:

driver.get(url)

content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")

column = soup.find_all("p")

and then playing with the length of the element:

for bullet in column:
        if len(bullet.find_all("br"))==4:
            person = {}
            person["NAME"]=bullet.contents[0].strip()
            person["PROFESSION"]=bullet.contents[2].strip()
            person["DEPARTMENT"]=bullet.contents[4].strip()
            person["INSTITUTION"]=bullet.contents[6].strip()
            person["LOCATION"]=bullet.contents[8].strip()

However, I have 2 issues.

  1. I am unable to scrape the information for the chairperson (GUDJONSSON) which is not contained inside a "p" tag. I was trying something like:
soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()

but it is not working

  1. I am unable to differentiate between the last 2 persons (WONDRAK and GERSCH) because they are both contained inside the same "p" tag.

Any help would be extremely useful! Thanks in advance!


Solution

  • This is a case where it may be easier to handle processing the data more as plain text than as HTML, after initially extracting the element you're looking for. The reason is that the HTML is not very well formatted for parsing / it doesn't follow a very uniform pattern. The html5lib package generally handles poorly formatted html better than html.parser, but it didn't help significantly in this case.

    import re
    from typing import Collection, Iterator
    
    from bs4 import BeautifulSoup
    
    
    def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -> Iterator[str]:
        for sibling in soup.find('b').next_siblings:
            for block in sibling.stripped_strings:
                block_str = ' '.join(filter(None, (line.strip() for line in block.split('\n'))))
                if block_str and block_str not in ignore:
                    yield block_str
    
    
    def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -> list[list[str]]:
        zip_code_pattern = re.compile(r', \d+$')
        people = []
        person = []
        for line in iter_lines(soup, ignore):
            person.append(line)
            if zip_code_pattern.search(line):
                people.append(person)
                person = []
    
        return people
    
    
    def normalize_person(raw_person: list[str]) -> dict[str, str | None]:
        return {
            'NAME': raw_person[0],
            'PROFESSION': raw_person[1] if len(raw_person) > 4 else None,
            'DEPARTMENT': next((line for line in raw_person if 'DEPARTMENT' in line), None),
            'INSTITUTION': raw_person[-2],
            'LOCATION': raw_person[-1],
        }
    
    
    raw_people = group_people(soup, ignore={'SCIENTIFIC REVIEW OFFICER'})
    normalized = [normalize_person(person) for person in raw_people]
    

    This works with both BeautifulSoup(content, 'html.parser') and BeautifulSoup(content, 'html5lib').

    The iter_lines function finds the first <b> tag like you did before, and then yields a single string for each line that is displayed in a browser.

    The group_people function groups the lines into separate people, using the zip code at the end to indicate that that person's entry is complete. It may be possible to combine this function with iter_lines and skip the regex, but this was slightly easier. Better formatted html would be more conducive to that approach.

    The ignore parameter was used to skip the SCIENTIFIC REVIEW OFFICER header above the last person on that page.

    Lastly, the normalize_person function attempts to interpret what each line for a given person means. The name, institution, and location appear to be fairly consistent, but I took some liberties with profession and department to use None when it seemed like a value did not exist. Those decisions were only made based on the particular page you linked to - you may need to adjust those for other pages. It uses negative indexes for the institution and location because the number of lines that existed for each person's data was variable.