python pandas web-scraping dataset wikipedia

Panda not printing all of the table

This is my first post so I hope I don't forget anything.

So I was trying to scrape all of the UFC events to see certain stats of fighters and I tried using Pandas.

This is where my problem started, so when I imported the website using


import pandas as pd

tables = pd.read_html('https://en.wikipedia.org/wiki/UFC_168')

print(tables[2])

Now with this, I get the output

Main card ...

Weight class Unnamed: 1_level_1 ... Time Notes

0 Middleweight Chris Weidman (c) ... 1:16 [a]

1 Women's Bantamweight Ronda Rousey (c) ... 0:58 [b]

2 Heavyweight Travis Browne ... 1:00 NaN

3 Lightweight Jim Miller ... 3:42 NaN

4 Catchweight (151.5 lb) Dustin Poirier ... 4:54 NaN

5 Preliminary card (Fox Sports 1) Preliminary card (Fox Sports 1) ... Preliminary card (Fox Sports 1) Preliminary card (Fox Sports 1)

6 Middleweight Uriah Hall ... 5:00 [c]

7 Lightweight Michael Johnson ... 1:32 [d] 8 Featherweight Dennis Siver ... 5:00 [e]

9 Welterweight John Howard ... 5:00 NaN

10 Preliminary card (Online) Preliminary card (Online) ... Preliminary card (Online) Preliminary card (Online)

11 Welterweight William Macário ... 5:00 NaN

12 Featherweight Robbie Peralta ... 0:12 NaN

This output is missing 3 key columns to do my research. The opponent, method of finish, and what round the fight finished. If you guys have any inclination of how or why these pieces are missing please let me know. Thanks

Solution

Pandas web scraping is not as robust as BeautifulSoup (the best open source scraping imo). It also gives you more control on each variables / structured data that you extract. I would therefore approach your problem with following code:

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/UFC_168'
response = requests.get(url)

# Parse HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table containing fight data
table = soup.find('table', {'class': 'toccolours'})

# Iterate through the rows in the table
for row in table.find_all('tr')[1:]: # Skip the header row
    columns = row.find_all('td')
    
    # Check if the row contains fight data
    if len(columns) >= 5:
        weight_class = columns[0].get_text(strip=True)
        fighter = columns[1].get_text(strip=True)
        rel = columns[2].get_text(strip=True)
        opponent = columns[3].get_text(strip=True)
        method = columns[4].get_text(strip=True)
        round_finished = columns[5].get_text(strip=True)
        time = columns[6].get_text(strip=True)
        
        print(f'{weight_class} | {fighter} {rel} {opponent} | {method} | {round_finished} | {time}')

Which nicely gives you access to key variables of interest (including opponent, method of finish & round finished) and gives you the following output:

Middleweight | Chris Weidman(c) def. Anderson Silva | TKO (leg injury) | 2 | 1:16
Women's Bantamweight | Ronda Rousey(c) def. Miesha Tate | Submission (armbar) | 3 | 0:58
Heavyweight | Travis Browne def. Josh Barnett | KO (elbows) | 1 | 1:00
Lightweight | Jim Miller def. Fabrício Camões | Submission (armbar) | 1 | 3:42
Catchweight (151.5 lb) | Dustin Poirier def. Diego Brandão | KO (punches) | 1 | 4:54
Middleweight | Uriah Hall def. Chris Leben | TKO (retirement) | 1 | 5:00
Lightweight | Michael Johnson def. Gleison Tibau | KO (punches) | 2 | 1:32
Featherweight | Dennis Siver vs. Manny Gamburyan | No Contest (overturned) | 3 | 5:00
Welterweight | John Howard def. Siyar Bahadurzada | Decision (unanimous) (30–27, 30–27, 30–27) | 3 | 5:00
Welterweight | William Macário def. Bobby Voelker | Decision (unanimous) (30–27, 30–27, 30–27) | 3 | 5:00
Featherweight | Robbie Peralta def. Estevan Payan | KO (punches) | 3 | 0:12

Note that you have to import the 2 Python packages Requests and BeautifulSoup, for the code above to work. Here for convenience:

pip install -U requests beautifulsoup4