With the following code I can get all data from the noted number of pages at the given URL:
import pandas as pd
F, L = 1, 2 # first and last pages
dico = {}
for page in range(F, L+1):
url = f'https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page={page}&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17'
sub_df = pd.read_html(url, parse_dates=True)[0]
#sub_df.insert(0, "page_number", page)
sub_df.insert(1, "Year", "AT")
sub_df.insert(2, "Ind_Out", "I")
sub_df.insert(3, "Gender", "M")
sub_df.insert(4, "Event", "MILLA")
sub_df.insert(5, "L_N", "L")
dico[page] = sub_df
out = pd.concat(dico, ignore_index=True)
out.to_csv('WA_AT_I_M_MILLA_L.csv')
But I need to get the athletes' code (field "Competitor").
How could I insert a field with the href of each competitor?
I'm not really sure why you're doing everything you're doing in your code, but to get the table on that page with an additional column for the competitor code from the link, I would do this (in this example, just for the first page, but you can obviously extend it):
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
url = "https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page=1&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17"
req = requests.get(url)
#this gets you the whole table, as is:
sub_df = pd.read_html(req.text)[0]
#we need this to extract the codes:
soup = bs(req.text,"html.parser")
codes = [comp['href'].split('=')[1] for comp in soup.select('table.records-table td[data-th="Competitor"] a')]
#we then insert the codes as a new column in the df
sub_df.insert(3, 'Code', codes)
You should now have a new column right after Competitor
. You can drop whatever column you don't want, add other columns and so on.