Search code examples

With a list of URLs, how do I retrieve the data from a table in HTML from each individual URL

"Hello, i am quite new to Python and webscraping. I have obtained a list of URLs and would like to retrieve data from a table within each individual link, however, am facing some problems"

"Here's what I've tried so far"

#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

#start of code
mainurl = ""
def getAndParseURL(mainurl):
   result = requests.get(mainurl)
   soup = BeautifulSoup(result.content, 'html.parser')
   datatable = soup.find_all('a', href = True)
   return datatable

datatable = getAndParseURL(mainurl)

#go through the content and grab the URLs
links = []
for link in datatable:
    if 'Year' in link['href']:
        url = link['href']
        links.append(mainurl + url)

#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])


#create empty array
accidentdata = []
#Loop through the URLs retrieved previously
for x in df['url']:
    html = requests.get(x).text
    soup = BeautifulSoup(html, "html.parser")
#identify table we want to scrape
    accidentdata_table = soup.find('table', {"class" : "list"})

#try clause to skip any other tables
#loop through table, grab each of the 9 columns in the accident data
    for row in accidentdata_table.find_all('tr'):
        cols = row.find_all('td')
        if len(cols) == 9:
            accidentdata.append((x, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip(), cols[4].text.strip(), cols[5].text.strip(), cols[6].text.strip, cols[7].text.strip(), cols[8].text.strip()))
except: pass

#convert output to new array, check length
accidentdata_array = np.asarray(accidentdata)

#convert new array to dataframe
df = pd.DataFrame(accidentdata_array)

"The output of the len(accidentdata_array) is 0. The code seems to be able to scrape but I'm not getting the desired results"

I hope to get data from the following columns: date; type; registration; operator; fatalities; location; category.

Is there something wrong with the code? Any help is much appreciated, thank you!"


  • made a few modifications, but the main issue was you need to add user-agent into your requests .

    1. Added a headers parameter with user-agent
    2. used Pandas to pull the table as pd.read_html() just uses bs4 under the hood to parse <table> tags
    3. It's doesn't find a table in every link, so I added a list to hold where it didn't find a table, and you can investigate those


    import numpy as np
    import pandas as pd
    from bs4 import BeautifulSoup
    import requests
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
    #start of code
    mainurl = ""
    def getAndParseURL(mainurl):
       result = requests.get(mainurl)
       soup = BeautifulSoup(result.content, 'html.parser')
       datatable = soup.find_all('a', href = True)
       return datatable
    datatable = getAndParseURL(mainurl)
    #go through the content and grab the URLs
    links = []
    for link in datatable:
        if 'Year' in link['href']:
            url = link['href']
            links.append(mainurl + url)
    #check if links are in dataframe
    df = pd.DataFrame(links, columns=['url'])
    #create empty datframe and empty list to store urls that didn't pull a table
    results_df = pd.DataFrame()
    no_table = []
    #Loop through the URLs retrieved previously and append to results_df
    for x in df['url']:
            html = requests.get(x, headers=headers).text   # <----- added headers
            table = pd.read_html(html)[0]    # <---- used pandas to read in the html and parse table tags. this will return a list of dataframes and want the dataframe in position 0
            results_df = results_df.append(table, sort=True).reset_index(drop=True)
            print ('Processed: %s' %x)
            print ('No table found: %s' %x)
    results_df = results_df[['date', 'type', 'registration', 'operator', 'fat.', 'location', 'cat']]


    print (no_table)
    print (results_df)
                 date                               type  ...              location cat
    0       date unk.                     Antonov An-12B  ...                   NaN  U1
    1       date unk.                     Antonov An-12B  ...                   NaN  U1
    2       date unk.                     Antonov An-12B  ...                   NaN  U1
    3       date unk.                    Antonov An-12BK  ...       Tiksi Airpor...  A1
    4       date unk.                    Antonov An-12BP  ...       Massawa Airp...  A1
    5       date unk.                    Antonov An-12BP  ...                   NaN  U1
    6       date unk.                       Antonov An-2  ...               unknown  A1
    7       date unk.                       Antonov An-2  ...          Chita region  A2
    8       date unk.                     Antonov An-24B  ...                   NaN  A1
    9       date unk.                      Antonov An-26  ...       Belgorod Air...  A1
    10      date unk.                      Antonov An-26  ...       Wadi Bu al H...  A1
    11      date unk.                      Antonov An-26  ...                   NaN  A1
    12      date unk.                      Antonov An-26  ...       Orenburg Air...  O1
    13      date unk.                      Antonov An-2R  ...                   NaN  U1
    14      date unk.                      Antonov An-2R  ...                Mielec  O1
    15      date unk.                      Antonov An-32  ...       Kalaikunda A...  A1
    16      date unk.                     Antonov An-32A  ...                   NaN  A1
    17      date unk.                            Avia 14  ...       Sofia-Vrazhd...  O1
    18      date unk.                     BN-2A Islander  ...                   NaN  U1
    19      date unk.                     BN-2A Islander  ...                   NaN  U1
    20      date unk.                     BN-2A Islander  ...       Nassau Inter...  A1
    21      date unk.                     BN-2A Islander  ...                   NaN  U1
    22      date unk.                  BN-2A-20 Islander  ...       Charles Prin...  U1
    23      date unk.                  BN-2A-21 Islander  ...                   NaN  U1
    24      date unk.                  BN-2A-21 Islander  ...                   NaN  U1
    25      date unk.                  BN-2A-21 Islander  ...                   NaN  U1
    26      date unk.                  BN-2A-21 Islander  ...                   NaN  U1
    27      date unk.                  BN-2A-26 Islander  ...       Paphos Inter...  U1
    28      date unk.                   BN-2A-8 Islander  ...              Toluca ?  U1
    29      date unk.                   BN-2A-8 Islander  ...                   NaN  U1
              ...                                ...  ...                   ...  ..
    8468  19-JUN-2019                 Antonov An-124-100  ...       Tripoli Inte...  C1
    8469  20-JUN-2019                       Antonov An-2  ...  near Rodina villa...  A1
    8470  21-JUN-2019            Basler Turbo 67 (DC-3T)  ...  near Fort Hope Ai...  A2
    8471  23-JUN-2019                       Antonov An-2  ...  near Mlyny, Polta...  A1
    8472  24-JUN-2019         Hawker Siddeley HS-125-400  ...       Parque Nacio...  O1
    8473  27-JUN-2019                    Antonov An-24RV  ...       Nizhneangars...  A1
    8474  27-JUN-2019              BAe 3212 Jetstream 31  ...       Canaima Airp...  A1
    8475  28-JUN-2019                          Saab 340A  ...       Nassau-Lynde...  O2
    8476  29-JUN-2019          Cessna 208B Grand Caravan  ...       Plant City-B...  A2
    8477  30-JUN-2019           Beech B300 King Air 350i  ...       Dallas-Addis...  A1
    8478  01-JUL-2019                     Boeing 737-85R  ...       Mumbai-Chhat...  A2
    8479  08-JUL-2019                    Airbus A320-214  ...       Tripoli-Miti...  C2
    8480  08-JUL-2019          Cessna 208B Grand Caravan  ...       Bethel Airpo...  A1
    8481  08-JUL-2019                    Canadair CL-415  ...  near Roberval Air...  A2
    8482  09-JUL-2019               Airbus A320-214 (WL)  ...       Amsterdam-Sc...  A2
    8483  09-JUL-2019                Boeing 737-8K2 (WL)  ...       Amsterdam-Sc...  A2
    8484  09-JUL-2019                       Antonov An-2  ...  near Raduga, Novo...  A1
    8485  13-JUL-2019          Beech B200 Super King Air  ...       Graham Creek...  C1
    8486  16-JUL-2019                       Antonov An-2  ...       Novoshchedri...  A1
    8487  17-JUL-2019             Cessna 550 Citation II  ...       Mesquite Mun...  A1
    8488  19-JUL-2019                  DHC-8-402Q Dash 8  ...       Edmonton Int...  A2
    8489  20-JUL-2019                         ATR 42-500  ...       Gilgit Airpo...  A2
    8490  23-JUL-2019                Boeing 737-36N (WL)  ...       Lagos-Murtal...  A2
    8491  25-JUL-2019                   Ilyushin Il-76TD  ...       Al Jufra Air...  C1
    8492  25-JUL-2019                   Ilyushin Il-76TD  ...       Al Jufra Air...  C1
    8493  26-JUL-2019             Cessna 208 Caravan 675  ...       Addenbroke I...  A1
    8494  27-JUL-2019      Swearingen SA227-AC Metro III  ...       El Paso Inte...  A2
    8495  30-JUL-2019           Beech B300 King Air 350i  ...       Mora Kalu, R...  A1
    8496  30-JUL-2019                     Antonov An-72P  ...    near Grand Batanga  A1
    8497  01-AUG-2019  Douglas C-118A Liftmaster (DC-6A)  ...       Candle 2 Air...  A2
    [8498 rows x 7 columns]