Search code examples
pythonselenium-webdriverweb-scrapingbeautifulsoup

Python - Scraping and classifying text in "fonts"


I would like to scrape the content of this website https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283 and create a table with the columns NAME, TITLE, LOCATION. I know some individuals have more or less "lines", but I am just trying to understand how I could even classify the first 3 lines for each person given that the text is in between "fonts" for all.

So far I have:

 url="https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283"
 driver.maximize_window()
 driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
 driver.get(url)

 content = driver.page_source.encode('utf-8').strip()
 soup = BeautifulSoup(content,"html.parser")

 column = soup.find_all("font")

But once I am there and I have all the text within "font" in my "column" variable, I don't know how to proceed to differentiate between each person and build a loop where I would retrieve name, title, location etc. for each.

Any help would be highly appreciated!


Solution

  • Note: instead of using selenium, I simply fetched and parsed with soup = BeautifulSoup(requests.get(url).content, "html.parser"); as far as I an tell, the required section is not dynamic, so it shouldn't cause any issues.


    would you have any idea about how to look for pairs of <br>

    Since they represent empty lines, you could try simply splitting the text in that cell by \n\n\n

    blockText = soup.select_one('td:has(font)').get_text(' ')
    blockText = blockText.replace('-'*10, '\n\n\n') # pad "underlined" lines
    blockSections = [sect.strip() for sect in '\n'.join([
        l.strip('-').strip() for l in blockText.splitlines()
    ]).split('\n\n\n') if sect.strip()]
    

    Although, if you looked at blockSections, you might notice that some headers [ROSTER and MEMBERS] get stuck to the end of the previous section - probably because their formatting means that an extra <br> is not needed to distinguish them from their adjacent sections. [I added the .replace('-'*10, '\n\n\n') line so that at least they're separated from the next section.]

    Another risk is that I don't know if all versions and parsers will parse <br><br> to text as 3 line breaks - some omit br space entirely from text, and others might add extra space based on spaces between tags in the source html.


    It's easier to split if you loop through the <br>s and pad them with something more distinctive to split by; the .insert... methods are useful here. (This method also has the advantage of being able to target bolded lined as well.)

    blockSoup = soup.select_one('td:has(font)')
    for br2 in blockSoup.select('br+br, font:has(br)'): 
        br2.insert_after(BeautifulSoup(f'<p>{"="*80}</p>').p)
        br2.insert_before(BeautifulSoup(f'<p>{"="*80}</p>').p)
    blockSections = [
        sect.strip().strip('-').strip() for sect in 
        blockSoup.get_text(' ').split("="*80) if sect.strip()
    ]
    

    This time, blockSections looks something like

    ['Membership Roster - ACE\n AIDS CLINICAL STUDIES AND EPIDEMIOLOGY STUDY SECTION\n Center For Scientific Review\n (Terms end 6/30 of the designated year)\n ROSTER',
     'CHAIRPERSON',
     'SCHACKER, TIMOTHY\n W\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF MINNESOTA\n MINNEAPOLIS,\n MN\n 55455',
     'MEMBERS',
     'ANDERSON, JEAN\n R\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF GYNECOLOGY AND OBSTETRICS\n JOHNS HOPKINS UNIVERSITY\n BALTIMORE,\n MD 21287',
     'BALASUBRAMANYAM, ASHOK\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE AND\n MOLECULAR AND CELLULAR BIOLOGY\n DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM\n BAYLOR COLLEGE OF MEDICINE\n HOUSTON,\n TX 77030',
     'BLATTNER, WILLIAM\n ALBERT\n , MD,\n (15)\n PROFESSOR AND ASSOCIATE DIRECTOR\n DEPARTMENT OF MEDICNE\n INSTITUTE OF HUMAN VIROLOGY\n UNIVERSITY OF MARYLAND, BALTIMORE\n BALTIMORE,\n MD 21201',
     'CHEN, YING\n QING\n , PHD,\n (15)\n PROFESSOR\n PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS\n FRED HUTCHINSON CANCER RESEARCH CENTER\n SEATTLE,\n WA 981091024',
     'COTTON, DEBORAH\n , MD,\n (13)\n PROFESSOR\n SECTION OF INFECTIOUS DISEASES\n DEPARTMENT OF MEDICINE\n BOSTON UNIVERSITY\n BOSTON,\n MA 02118',
     'DANIELS, MICHAEL\n J\n , SCD,\n (16)\n PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF TEXAS AT AUSTIN\n AUSTIN,\n TX 78712',
     'FOULKES, ANDREA\n SARAH\n , SCD,\n (14)\n ASSOCIATE PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF MASSACHUSETTS\n AMHERST,\n MA 01003',
     'HEROLD, BETSY\n C\n , MD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n ALBERT EINSTEIN COLLEGE OF MEDICINE\n BRONX,\n NY 10461',
     'JUSTICE, AMY\n CAROLINE\n , MD, PHD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n YALE UNIVERSITY\n NEW HAVEN,\n CT 06520',
     'KATZENSTEIN, DAVID\n ALLENBERG\n , MD,\n (13)\n PROFESSOR\n DIVISION OF INFECTIOUS DISEASES\n STANFORD UNIVERSITY SCHOOL OF MEDICINE\n STANFORD,\n CA 94305',
     'MARGOLIS, DAVID\n M\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL\n CHAPEL HILL,\n NC 27599',
     'MONTANER, LUIS\n J\n , DVM, PHD,\n (13)\n PROFESSOR\n DEPARTMENT OF IMMUNOLOGY\n THE WISTAR INSTITUTE\n PHILADELPHIA,\n PA 19104',
     'MONTANO, MONTY\n A\n , PHD,\n (15)\n RESEARCH SCIENTIST\n DEPARTMENT OF IMMUNOLOGY AND\n INFECTIOUS DISEASES\n BOSTON UNIVERSITY\n BOSTON,\n MA 02115',
     'PAGE, KIMBERLY\n , PHD, MPH,\n (16)\n PROFESSOR\n DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH\n AND GLOBAL HEALTH SCIENCES\n DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n SAN FRANCISCO,\n CA 94105',
     'SHIKUMA, CECILIA\n M\n , MD,\n (15)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n HAWAII AIDS CLINICAL RESEARCH PROGRAM\n UNIVERSITY OF HAWAII\n HONOLULU,\n HI 96816',
     'WOOD, CHARLES\n , PHD,\n (13)\n PROFESSOR\n UNIVERSITY OF NEBRASKA\n LINCOLN,\n NE 68588']
    

    create a table with the columns NAME, TITLE, LOCATION

    There may be a more elegant solution, but I feel like the simplest way would be to just loop the siblings of the headers and keep count of consecutive brs.

    doubleBr = soup.select('br')[:2] # [ so the last person also gets added ]
    personsList = []
    for f in soup.select('td>font>font:has(b br)'):
        role, lCur,pCur,brCt = f.get_text(' ').strip('-').strip(), [],[],0 
        for lf in f.find_next_siblings(['font','br'])+doubleBr:
            brCt = brCt+1 if lf.name == 'br' else 0 
            if pCur and (brCt>1 or lf.b):
                pDets = {'role': role, 'name': '?'} # initiate
    
                if len(pCur)>1: pDets['title'] = pCur[1]
                pDets['name'], pCur = pCur[0], pCur[2:]
                
                dList = pCur[:-2] 
                pDets['departments'] = dList[0] if len(dList)==1 else dList
    
                if len(pCur)>1: pDets['institute'] = pCur[-2]
                if pCur: pDets['location'] = pCur[-1]
    
                personsList.append(pDets)      
                pCur, lCur, brCt = [], [], 0 # clear
            if lf.b: break # rached next section
            if lf.name == 'font': # [split and join to minimize whitespace]
                lCur.append(' '.join(lf.get_text(' ').split())) # add to line
            if brCt and lCur: pCur, lCur = pCur+[' '.join(lCur)], [] # newline 
    

    Since personsList is a list of dictionaries, it can be tabulated as simply as pandas.DataFrame(personsList) to get a DataFrame that looks like:

    role name title departments institute location
    CHAIRPERSON SCHACKER, TIMOTHY W , MD PROFESSOR DEPARTMENT OF MEDICINE UNIVERSITY OF MINNESOTA MINNEAPOLIS, MN 55455
    MEMBERS ANDERSON, JEAN R , MD PROFESSOR DEPARTMENT OF GYNECOLOGY AND OBSTETRICS JOHNS HOPKINS UNIVERSITY BALTIMORE, MD 21287
    MEMBERS BALASUBRAMANYAM, ASHOK , MD PROFESSOR ['DEPARTMENT OF MEDICINE AND', 'MOLECULAR AND CELLULAR BIOLOGY', 'DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM'] BAYLOR COLLEGE OF MEDICINE HOUSTON, TX 77030
    MEMBERS BLATTNER, WILLIAM ALBERT , MD PROFESSOR AND ASSOCIATE DIRECTOR ['DEPARTMENT OF MEDICNE', 'INSTITUTE OF HUMAN VIROLOGY'] UNIVERSITY OF MARYLAND, BALTIMORE BALTIMORE, MD 21201
    MEMBERS CHEN, YING QING , PHD PROFESSOR PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS FRED HUTCHINSON CANCER RESEARCH CENTER SEATTLE, WA 981091024
    MEMBERS COTTON, DEBORAH , MD PROFESSOR ['SECTION OF INFECTIOUS DISEASES', 'DEPARTMENT OF MEDICINE'] BOSTON UNIVERSITY BOSTON, MA 02118
    MEMBERS DANIELS, MICHAEL J , SCD PROFESSOR DEPARTMENT OF BIOSTATISTICS UNIVERSITY OF TEXAS AT AUSTIN AUSTIN, TX 78712
    MEMBERS FOULKES, ANDREA SARAH , SCD ASSOCIATE PROFESSOR DEPARTMENT OF BIOSTATISTICS UNIVERSITY OF MASSACHUSETTS AMHERST, MA 01003
    MEMBERS HEROLD, BETSY C , MD PROFESSOR DEPARTMENT OF PEDIATRICS ALBERT EINSTEIN COLLEGE OF MEDICINE BRONX, NY 10461
    MEMBERS JUSTICE, AMY CAROLINE , MD, PHD PROFESSOR DEPARTMENT OF PEDIATRICS YALE UNIVERSITY NEW HAVEN, CT 06520
    MEMBERS KATZENSTEIN, DAVID ALLENBERG , MD PROFESSOR DIVISION OF INFECTIOUS DISEASES STANFORD UNIVERSITY SCHOOL OF MEDICINE STANFORD, CA 94305
    MEMBERS MARGOLIS, DAVID M , MD PROFESSOR DEPARTMENT OF MEDICINE UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL CHAPEL HILL, NC 27599
    MEMBERS MONTANER, LUIS J , DVM, PHD PROFESSOR DEPARTMENT OF IMMUNOLOGY THE WISTAR INSTITUTE PHILADELPHIA, PA 19104
    MEMBERS MONTANO, MONTY A , PHD RESEARCH SCIENTIST ['DEPARTMENT OF IMMUNOLOGY AND', 'INFECTIOUS DISEASES'] BOSTON UNIVERSITY BOSTON, MA 02115
    MEMBERS PAGE, KIMBERLY , PHD, MPH PROFESSOR ['DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH', 'AND GLOBAL HEALTH SCIENCES', 'DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS', 'UNIVERSITY OF CALIFORNIA, SAN FRANCISCO'] UNIVERSITY OF CALIFORNIA, SAN FRANCISCO SAN FRANCISCO, CA 94105
    MEMBERS SHIKUMA, CECILIA M , MD PROFESSOR ['DEPARTMENT OF MEDICINE', 'HAWAII AIDS CLINICAL RESEARCH PROGRAM'] UNIVERSITY OF HAWAII HONOLULU, HI 96816
    MEMBERS WOOD, CHARLES , PHD PROFESSOR [] UNIVERSITY OF NEBRASKA LINCOLN, NE 68588

    [ Btw, if the .select('br+br, font:has(br)') and .select('td>font>font:has(b br)') parts are unfamiliar to you, you can look up .select and CSS selectors. Combinators [like >/+/,] and pseudo-classes [like :has] allow us to get very specific with out targets. ]