Search code examples
pythonpandasbeautifulsouprequestpython-requests-html

How to read a webpage table using requests-html?


I am new to python and am trying to parse a table from the given website into a PANDAS DATAFRAME.

I am using modules requests-html, requests, and beautifulSoup.

Here is the website, I would like to gather the table from: https://www.aamc.org/data-reports/workforce/interactive-data/active-physicians-largest-specialties-2019

MWE

import pandas as pd
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

url = 'https://www.aamc.org/data-reports/workforce/interactive-data/active-physicians-largest-specialties-2019'

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req).read()

soup = BeautifulSoup(page, 'html.parser')

# soup.find_all('table')
pages = soup.find('div', {'class': 'data-table-wrapper'})
df = pd.read_html(pages) # PROBLEM: somehow this table has no data
df.head()

Another attempt:

import requests_html

sess = requests_html.HTMLSession()
res = sess.get(url)
page = res.html
import requests_html

sess = requests_html.HTMLSession()
res = sess.get(url)
page_html = res.html

df = pd.read_html(page_html.raw_html)
df # This gives dataframe, but has no Values

The screenshot is given below: enter image description here


Solution

  • The data you see on the page is embedded inside <script> in form of JavaScript. You can use selenium or parse the data manually from the page. I'm using js2py module to decode the data:

    import re
    import js2py
    import requests
    import pandas as pd
    
    
    url = "https://www.aamc.org/data-reports/workforce/interactive-data/active-physicians-largest-specialties-2019"
    html_doc = requests.get(url).text
    
    data = re.search(r"(?s)\$scope\.schools = (.*?);", html_doc).group(1)
    data = [{k: v.strip() for k, v in d.items()} for d in js2py.eval_js(data)]
    
    columns = {
        "specialty": "Specialty",
        "one": "Total Active Physicians",
        "two": "Patient Care",
        "three": "Teaching",
        "four": "Research",
        "five": "Other",
    }
    
    df = pd.DataFrame(data).rename(columns=columns)
    print(df[list(columns.values())].to_markdown(index=False))
    

    Prints:

    Specialty Total Active Physicians Patient Care Teaching Research Other
    All Specialties 938,980 816,922 12,475 12,632 96,951
    Allergy and Immunology 4,900 4,221 54 268 357
    Anatomic/Clinical Pathology 12,643 8,711 385 520 3,027
    Anesthesiology 42,267 39,377 540 180 2,170
    Cardiovascular Disease 22,521 20,430 299 573 1,219
    Child and Adolescent Psychiatry 9,787 8,670 134 109 874
    Critical Care Medicine 13,093 11,146 178 111 1,658
    Dermatology 12,516 11,747 100 98 571
    Emergency Medicine 45,202 41,466 469 94 3,173
    Endocrinology, Diabetes, and Metabolism 7,994 6,439 155 533 867
    Family Medicine/General Practice 118,198 108,984 1,614 251 7,349
    Gastroenterology 15,469 14,007 186 289 987
    General Surgery 25,564 21,949 259 137 3,219
    Geriatric Medicine 5,974 5,029 105 106 734
    Hematology and Oncology 16,274 13,506 250 871 1,647
    Infectious Disease 9,687 7,448 287 701 1,251
    Internal Medicine 120,171 105,736 1,409 1,447 11,579
    Internal Medicine/Pediatrics 5,509 4,924 74 28 483
    Interventional Cardiology 4,407 3,956 22 6 423
    Neonatal-Perinatal Medicine 5,919 5,008 135 175 601
    Nephrology 11,407 9,964 140 316 987
    Neurological Surgery 5,748 5,246 52 32 418
    Neurology 14,146 11,896 245 629 1,376
    Neuroradiology 4,089 3,496 63 7 523
    Obstetrics and Gynecology 42,720 39,825 499 195 2,201
    Ophthalmology 19,312 17,859 147 126 1,180
    Orthopedic Surgery 19,069 18,097 120 57 795
    Otolaryngology 9,777 9,140 90 23 524
    Pain Medicine and Pain Management 5,871 5,459 38 9 365
    Pediatric Anesthesiology (Anesthesiology) 2,571 2,127 47 4 393
    Pediatric Cardiology 2,966 2,414 74 64 414
    Pediatric Critical Care Medicine 2,639 2,118 78 20 423
    Pediatric Hematology/Oncology 3,079 2,251 77 210 541
    Pediatrics 60,618 54,764 844 663 4,347
    Physical Medicine and Rehabilitation 9,767 8,920 69 38 740
    Plastic Surgery 7,317 6,938 55 20 304
    Preventive Medicine 6,675 4,218 146 457 1,854
    Psychiatry 38,792 33,776 562 735 3,719
    Pulmonary Disease 5,106 4,490 138 296 182
    Radiation Oncology 5,306 4,854 56 33 363
    Radiology and Diagnostic Radiology 28,025 24,748 423 153 2,701
    Rheumatology 6,265 5,333 108 255 569
    Sports Medicine 2,897 2,624 20 4 249
    Sports Medicine (Orthopedic Surgery) 2,903 2,737 9 157
    Thoracic Surgery 4,479 4,105 45 40 289
    Urology 10,201 9,593 76 39 493
    Vascular and Interventional Radiology 3,877 3,425 27 3 422
    Vascular Surgery 3,943 3,586 48 13 296