Search code examples
pythonpandasweb-scrapinghref

Web scraping with Python. Get href from "a" elements


With the following code I can get all data from the noted number of pages at the given URL:

import pandas as pd

F, L = 1, 2 # first and last pages

dico = {}
for page in range(F, L+1):
    url = f'https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page={page}&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17'
    sub_df = pd.read_html(url, parse_dates=True)[0]
    #sub_df.insert(0, "page_number", page)
    sub_df.insert(1, "Year", "AT")
    sub_df.insert(2, "Ind_Out", "I")
    sub_df.insert(3, "Gender", "M")
    sub_df.insert(4, "Event", "MILLA")
    sub_df.insert(5, "L_N", "L")
    dico[page] = sub_df
    
out = pd.concat(dico, ignore_index=True)
out.to_csv('WA_AT_I_M_MILLA_L.csv')

But I need to get the athletes' code (field "Competitor").

How could I insert a field with the href of each competitor?


Solution

  • I'm not really sure why you're doing everything you're doing in your code, but to get the table on that page with an additional column for the competitor code from the link, I would do this (in this example, just for the first page, but you can obviously extend it):

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup as bs
    
    url = "https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page=1&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17"
    req =  requests.get(url)
    
    #this gets you the whole table, as is:
    sub_df = pd.read_html(req.text)[0]
    #we need this to extract the codes:
    soup = bs(req.text,"html.parser")
    codes = [comp['href'].split('=')[1] for comp in soup.select('table.records-table td[data-th="Competitor"] a')]
    
    #we then insert the codes as a new column in the df
    sub_df.insert(3, 'Code', codes)
    

    You should now have a new column right after Competitor. You can drop whatever column you don't want, add other columns and so on.