Search code examples
pythonweb-scrapingbeautifulsoup

Scrape HTML Table in Python


I am trying to scrape the SEC report page to pull some basic info on a number of tickers.

Here is an example URL for Apple - https://sec.report/CIK/0000320193

Within the page is a 'Company Details' table which includes basic info. I am essentially trying to just scrape the IRS Number, State of Incorp and Address.

I am cool with just scraping this chart and saving it into a PD Df. I am very new to web scraping so looking for some tips to make this work! Below is my code, but I don't know where to go once I extract the panel body. Thanks guys!

session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
page = requests.get('https://sec.report/CIK/0000051143.html', headers = headers)
page.content

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

soup.find_all(class_='panel-body')

Solution

  • Instead of BeautifoulSoup try with lxml package, for me it's easier to find elements with xpath sentences:

    import requests
    from lxml import html
    
    session = requests.Session()
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
    page = requests.get('https://sec.report/CIK/0000051143', headers=headers)
    
    raw_html = html.fromstring(page.text)
    
    irs = raw_html.xpath('//tr[./td[contains(text(),"IRS Number")]]/td[2]/text()')[0]
    
    state_incorp = raw_html.xpath('//tr[./td[contains(text(),"State of Incorporation")]]/td[2]/text()')
    
    address = raw_html.xpath('//tr[./td[contains(text(),"Business Address")]]/td[2]/text()')[0]