Search code examples
pandasweb-scrapingpython-requests-html

Read_html missing first table


The first table isn't coming through during scrape for election website:

url = https://electproject.github.io/Early-Vote-2020G/GA_RO.html

Here is code:

import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
    "accept": "application/json, text/javascript, */*; q=0.01",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-GB,en-US;q=0.9,en;q=0.8",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.99 Safari/537.36",
    "x-requested-with": "XMLHttpRequest",
}

url = r'https://electproject.github.io/Early-Vote-2020G/GA_RO.html'
r = requests.get(url, headers=headers).text
soup = BeautifulSoup(r, 'html.parser')

Tried both of these, but still didn't get first table with counties/votes/turnout rates

tables = soup.findAll('table')
dfs = list()
for table in tables: 
    df = pd.read_html(str(table))[0]
    dfs.append(df)

Other attempt:

df = pd.read_html(r, flavor='html5lib')

Both pull all other tables but not the first. I assume it's due to headers with sort capabilities, but not sure.


Solution

  • The problem is that the first table is rendered with JavaScript, there is no <table> in the HTML for that table.

    What you can do is get the data from the JavaScript directly (page source can be inspected to find the right <script> element):

    import json
    
    data = soup.findAll('script', {
        'data-for': 'htmlwidget-21712dd45dd736e3c1b9',
    })[0].contents[0]
    
    df = pd.DataFrame(json.loads(data)['x']['data']).T
    

    Output:

                 0      1      2         3
    0      APPLING   4453  12240  0.363807
    1     ATKINSON   1431   4939  0.289735
    2        BACON   3111   7071  0.439966
    3        BAKER    872   2297  0.379626
    4      BALDWIN  11850  27567  0.429862
    ..         ...    ...    ...       ...
    154  WHITFIELD  16980  57014  0.297822
    155     WILCOX   1582   4838  0.326995
    156     WILKES   2958   7204  0.410605
    157  WILKINSON   2236   6761   0.33072
    158      WORTH   4473  14601  0.306349