This is my first post so I hope I don't forget anything.
So I was trying to scrape all of the UFC events to see certain stats of fighters and I tried using Pandas.
This is where my problem started, so when I imported the website using
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/UFC_168')
print(tables[2])
Now with this, I get the output
Main card ...
Weight class Unnamed: 1_level_1 ... Time Notes
0 Middleweight Chris Weidman (c) ... 1:16 [a]
1 Women's Bantamweight Ronda Rousey (c) ... 0:58 [b]
2 Heavyweight Travis Browne ... 1:00 NaN
3 Lightweight Jim Miller ... 3:42 NaN
4 Catchweight (151.5 lb) Dustin Poirier ... 4:54 NaN
5 Preliminary card (Fox Sports 1) Preliminary card (Fox Sports 1) ... Preliminary card (Fox Sports 1) Preliminary card (Fox Sports 1)
6 Middleweight Uriah Hall ... 5:00 [c]
7 Lightweight Michael Johnson ... 1:32 [d] 8 Featherweight Dennis Siver ... 5:00 [e]
9 Welterweight John Howard ... 5:00 NaN
10 Preliminary card (Online) Preliminary card (Online) ... Preliminary card (Online) Preliminary card (Online)
11 Welterweight William Macário ... 5:00 NaN
12 Featherweight Robbie Peralta ... 0:12 NaN
This output is missing 3 key columns to do my research. The opponent, method of finish, and what round the fight finished. If you guys have any inclination of how or why these pieces are missing please let me know. Thanks
Pandas web scraping is not as robust as BeautifulSoup (the best open source scraping imo). It also gives you more control on each variables / structured data that you extract. I would therefore approach your problem with following code:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/UFC_168'
response = requests.get(url)
# Parse HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Find the table containing fight data
table = soup.find('table', {'class': 'toccolours'})
# Iterate through the rows in the table
for row in table.find_all('tr')[1:]: # Skip the header row
columns = row.find_all('td')
# Check if the row contains fight data
if len(columns) >= 5:
weight_class = columns[0].get_text(strip=True)
fighter = columns[1].get_text(strip=True)
rel = columns[2].get_text(strip=True)
opponent = columns[3].get_text(strip=True)
method = columns[4].get_text(strip=True)
round_finished = columns[5].get_text(strip=True)
time = columns[6].get_text(strip=True)
print(f'{weight_class} | {fighter} {rel} {opponent} | {method} | {round_finished} | {time}')
Which nicely gives you access to key variables of interest (including opponent, method of finish & round finished) and gives you the following output:
Middleweight | Chris Weidman(c) def. Anderson Silva | TKO (leg injury) | 2 | 1:16
Women's Bantamweight | Ronda Rousey(c) def. Miesha Tate | Submission (armbar) | 3 | 0:58
Heavyweight | Travis Browne def. Josh Barnett | KO (elbows) | 1 | 1:00
Lightweight | Jim Miller def. Fabrício Camões | Submission (armbar) | 1 | 3:42
Catchweight (151.5 lb) | Dustin Poirier def. Diego Brandão | KO (punches) | 1 | 4:54
Middleweight | Uriah Hall def. Chris Leben | TKO (retirement) | 1 | 5:00
Lightweight | Michael Johnson def. Gleison Tibau | KO (punches) | 2 | 1:32
Featherweight | Dennis Siver vs. Manny Gamburyan | No Contest (overturned) | 3 | 5:00
Welterweight | John Howard def. Siyar Bahadurzada | Decision (unanimous) (30–27, 30–27, 30–27) | 3 | 5:00
Welterweight | William Macário def. Bobby Voelker | Decision (unanimous) (30–27, 30–27, 30–27) | 3 | 5:00
Featherweight | Robbie Peralta def. Estevan Payan | KO (punches) | 3 | 0:12
Note that you have to import the 2 Python packages Requests
and BeautifulSoup
, for the code above to work. Here for convenience:
pip install -U requests beautifulsoup4