Search code examples
pythonweb-scrapingbeautifulsouphtml-parsing

Using BeautifulSoup to find a attribute called data-stats


I'm currently working on a web scraper that will allow me to pull stats from a football player. Usually this would be an easy task if I could just grab the divs however, this website uses a attribute called data-stats and uses it like a class. This is an example of that.

<th scope="row" class="left " data-stat="year_id"><a href="/years/2000/">2000</a></th>

If you would like to check the site for yourself here is the link.

https://www.pro-football-reference.com/players/B/BradTo00.htm

I'm tried a few different methods. Either It won't work at all or I will be able to start a for loop and start putting things into arrays, however you will notice that not everything in the table is the same var type.

Sorry for the formatting and the grammer.

Here is what I have so far, I'm sure its not the best looking code, it's mainly just code I've tried on my own and a few things mixed in from searching on Google. Ignore the random imports I was trying different things

# import libraries
import csv
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import lxml.html as lh
import pandas as pd

# specify url
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'

# request html
page = requests.get(url)

# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'lxml')
# find searches the given tag (div) with given class attribute and returns the first match it finds



headers = [c.get_text() for c in soup.find(class_ = 'table_container').find_all('td')[0:31]]

data = [[cell.get_text(strip=True) for cell in row.find_all('td')[0:32]]
        for row in soup.find_all("tr", class_=True)]

tags = soup.find(data ='pos')
#stats = tags.find_all('td')

print(tags)

Solution

  • You need to use the get method from BeautifulSoup to get the attributes by name See: BeautifulSoup Get Attribute

    Here is a snippet to get all the data you want from the table:

    from bs4 import BeautifulSoup
    import requests
    
    url = "https://www.pro-football-reference.com/players/B/BradTo00.htm"
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    
    # Get table
    table = soup.find(class_="table_outer_container")
    
    # Get head
    thead = table.find('thead')
    th_head = thead.find_all('th')
    
    for thh in th_head:
        # Get case value
        print(thh.get_text())
    
        # Get data-stat value
        print(thh.get('data-stat'))
    
    # Get body
    tbody = table.find('tbody')
    tr_body = tbody.find_all('tr')
    
    for trb in tr_body:
        # Get id
        print(trb.get('id'))
    
        # Get th data
        th = trb.find('th')
        print(th.get_text())
        print(th.get('data-stat'))
    
        for td in trb.find_all('td'):
            # Get case value
            print(td.get_text())
            # Get data-stat value
            print(td.get('data-stat'))
    
    # Get footer
    tfoot = table.find('tfoot')
    thf = tfoot.find('th')
    
    # Get case value
    print(thf.get_text())
    # Get data-stat value
    print(thf.get('data-stat'))
    
    for tdf in tfoot.find_all('td'):
        # Get case value
        print(tdf.get_text())
        # Get data-stat value
        print(tdf.get('data-stat'))
    

    You can of course save the data in a csv or even a json instead of printing it