Search code examples
pythonarraysbeautifulsoupresultset

Converting BS4 resultset to NxN array relative to headers (separate BS4 resultset)


TL;DR I need to turn a BS4 resultset list (single column) into an NxN array, but how? And how can I get headers attached that are also BS4 resultset list? Code below. Thank-you!

So I am attempting to web scrape sports data, but I'm having trouble converting the resultset into an NxN array. Additionally, I'm trying to include headers that were scraped in the same manner. Here's my code so far:

import requests
from bs4 import BeautifulSoup
from __future__ import print_function
import numpy as np

url=input("Paste player link and specific year ")
r= requests.get(url)
html_content=r.text
soup=BeautifulSoup(html_content,"lxml")

body = soup.body
table=body.table
tbody=table.tbody

headers = table.find_all("th")
statistics = tbody.find_all("td")

def string_stats():
    for stat in statistics:
        print (stat.string)

def string_headers():
    for head in headers:
        print (head.string)

string_stats_list = string_stats()
string_stats_list

This results in a vertical list of just the td tag elements as strings (or that was the goal).

So, my questions are: How can I get this single column list into an NxN array/matrix? Additionally, how can I get the headers attached?

Thanks for reading and/or the help!


Solution

  • import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    url='http://www.footballdb.com/players/mike-evans-evansmi03/gamelogs'
    r= requests.get(url)
    html_content=r.content
    soup=BeautifulSoup(html_content,"lxml")
    
    body = soup.body
    table=body.table
    
    headers = table.find_all("th")
    
    headers_list = [i.text for i in headers]
    
    string_stats_list = []
    row = []
    for i in table.select('tr')[1:]:
        for j in i.select('td'):
            row.append(j.text)
        string_stats_list.append(row)
        row = []
    
    df = pd.DataFrame(data=string_stats_list, columns=headers_list)