Search code examples
pythonhtmlweb-scrapingbeautifulsoup

Extract HTML Table Based on Specific Column Headers - Python


I am trying to extract html tables from the following URL .

For example, 2019 Director Compensation Table that is on page 44. I believe the table doesn't have a specific id, such as 'Compensation Table' etc.. To extract the table I can only think of matching column names or keywords such as "Stock Awards" or "All Other Compensation" then grabbing the associated table.

Is there an easy way to extract these tables based on column names? Or maybe an easier way?

Thanks!

I am relatively new at scraping HTML tables.. my code is as follows

from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm'
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser')
rows = soup.find_all('tr')

Solution

  • Sure you can do that, using pandas read_html function using match and attrs according to documentation.

    import pandas as pd
    
    df = pd.read_html(
        "https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm", attrs={'style': 'border-collapse: collapse; width: 100%; font: 9pt Arial, Helvetica, Sans-Serif'}, match="Non-Employee Directors")
    
    print(df)
    
    df[0].to_csv("data.csv", index=False, header=False)
    

    Output: View-Online

    enter image description here