Search code examples
pythonlistdataframescreen-scraping

Extracting a scraped list into new columns


I have this code (borrowed from an old question posted ont his site)

import pandas as pd
import json
import numpy as np
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml")
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, "html.parser")





#(The table has an id, it makes it more simple to target )
batting = doc.find(id='misc_batting')

careers = []
for row in batting.find_all('tr')[1:]:
    dictionary = {}
    dictionary['names'] = row.find(attrs = {"data-stat": "player"}).text.strip()
    dictionary['experience'] = row.find(attrs={"data-stat": "experience"}).text.strip()
    careers.append(dictionary)

Which generates a result like this:

[{'names': 'David Adams', 'experience': '1'}, {'names': 'Steve Ames', 'experience': '1'}, {'names': 'Rick Ankiel', 'experience': '11'}, {'names': 'Jairo Asencio', 'experience': '4'}, {'names': 'Luis Ayala', 'experience': '9'}, {'names': 'Brandon Bantz', 'experience': '1'}, {'names': 'Scott Barnes', 'experience': '2'}, {'names':

How do I create this into a column separated dataframe like this?

Names       Experience
David Adams   1

Solution

  • You can simplify this quite a bit with pandas. Have it pull the table, then you just want the Names and Yrs columns.

    import pandas as pd
    
    url = "https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml"
    df = pd.read_html(url, attrs = {'id': 'misc_batting'})[0]
    
    df_filter = df[['Name','Yrs']]
    

    If you need to rename those columns, add:

    df_filter = df_filter.rename(columns={'Name':'names','Yrs':'experience'})
    

    Output:

    print(df_filter)
                  names  experience
    0       David Adams           1
    1        Steve Ames           1
    2       Rick Ankiel          11
    3     Jairo Asencio           4
    4        Luis Ayala           9
    ..              ...         ...
    209    Dewayne Wise          11
    210       Ross Wolf           3
    211  Kevin Youkilis          10
    212   Michael Young          14
    213          Totals        1357
    
    [214 rows x 2 columns]