Search code examples
pythonpandasweb-scrapingpython-requestsfindall

How to scrape the pitcher's name and team?


I am new to scraping/coding and could use some help if possible.

  from bs4 import BeautifulSoup
  import requests
  import pandas as pd

  page_link ='https://www.baseball-reference.com/previews/index.shtml'
  page_response = requests.get(page_link, timeout=5)
  soup = BeautifulSoup(page_response.content, "html.parser")

I need help finding the appropriate way to find to extract the pitcher's name and team.

(examples only:)

  player_name = [i.text for i in soup.find_all('td', {'href': 'example-name'})]

  team = [i.text for i in soup.find_all('td', {'href': 'example-team'})]  

Here is where I export to excel:

  my_dict = dict(zip(player_name, team))

  df = pd.DataFrame(pd.Series(my_dict))

  writer = pd.ExcelWriter('pitching_webscrape.xlsx')
  df.to_excel(writer,'Sheet1')
  writer.save()

I would like the pitcher's name and team imported to excel. Thanks in advance for your help! Please let me know if I can improve my question or add more details.

Here is the code I had so far:

  from bs4 import BeautifulSoup
  import requests
  import pandas as pd
  page_link ='https://www.baseball-reference.com/previews/index.shtml'
  page_response = requests.get(page_link, timeout=5)
  soup = BeautifulSoup(page_response.content, "html.parser") 

My first code:

  t = soup.find_all('td')
  print(t)  

My first output:

[Blue Jays (60-70) , , Preview , Orioles (37-94) , , 7:05PM , TOR, Sam Gaviglio
(#43, 28, RHP, 3-6, 4.94), BAL, David Hess
(#41, 24, RHP, 2-8, 5.50), White Sox (51-79) , ,

My second code:

  t = soup.find_all('td')
  for a in t:
      print(a.text)  

My second output:

Blue Jays (60-70)

Preview

Orioles (37-94)

7:05PM

TOR Sam Gaviglio(#43, 28, RHP, 3-6, 4.94) BAL David Hess(#41, 24, RHP, 2-8, 5.50) White Sox (51-79)

I am getting closer,however, I only want the player's names and team's names. (i.e. TOR, Sam Gaviglio). I also want this imported into excel. Thanks! =)


Solution

  • If you just want a single list of players and teams, then this should suffice:

    import re
    players_and_teams = []
    
    for i in soup.find_all('td'):
        if i.find_all('a'):
            for link in i.find_all('a'):
                if not re.findall(r'Preview',link.text):
                    players_and_teams.append(link.text)