Search code examples
pythonhtmlweb-scrapingbeautifulsoup

Web scraping SEC filings


I am working on web scraping 10Q documents from SEC edgar.

This is the url link: https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm

If we inspect it you can find enter image description here

I need to extract 1600 Amphitheatre Parkway without using id. Below is a code snippet to extract text using id tag. However I need to se name tag.

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
page = session.get('https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm')
soup = BeautifulSoup(page.content, 'html.parser')

content = soup.find(id="d92517213e644-wk-Fact-0B11263160365DBABCF89969352EE602")
print(content.text)

Instead of id tag, I would like to use name tag. However I am not able to extract information sing name tag. Please help.

see the html information :

enter image description here

How to use name tag instead of id tag to extract the contents.

Thanks


Solution

  • You can find elements based on attribute values like this

    soup.find('html_tag',{"attribute":"value"})
    

    So in your case, name attribute exists on ix:nonnumeric tag

    content = soup.find('ix:nonnumeric',{"name":"dei:EntityAddressAddressLine1"})