Search code examples
pythondatatabledata-extraction

Online Table Extraction: 'NoneType' object has no attribute 'find_all'


The code runs up until the find_all instances are called. Hoping to be able to extract the data from the online table for some data science. Any Help would be greatly appreciated. `

# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Create an URL object
url = 'https://www.flightradar24.com/data/aircraft/ja11jc'
# Create object page
page = requests.get(url)

# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')
soup

# Obtain information from tag <table>
table1 = soup.find("table", id='tbl-datatable')
table1

# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
 title = i.text
 headers.append(title)


 # Create a dataframe
mydata = pd.DataFrame(columns = headers)

# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
 row_data = j.find_all('td')
 row = [i.text for i in row_data]
 length = len(mydata)
 mydata.loc[length] = row

`

So far I have tried looking at any syntax errors and other methods of extracting the information into an excel file but i have had no luck so far.


Solution

  • The code runs up until the find_all instances are called

    Nope, it does not…

    It brings up error at find_all, it does not mean though that it runs fine up to this point.
    In my case:

    1. soup was totally empty as requests return response code 451 which is an error code. Adding User-Agent in headers solves this issue.

    2. table1 was not found when parsing with lxml, one has to use html5 on this particular page.
      Why though? There might be many reasons for different behavior between parsers. Generally, as a rule of thumb, if lxml and html.parser don't bring you a desired result, try html5lib as it is more failsafe and "correct" yet not that fast.

    Now we are done with the initial error, yet there are a few more things to consider:

    1. Your 2nd loop will result in an error as it will include <tfoot> table row that is completely different from <tbody> rows and the resulting DataFrame.
      You should either cycle through <tbody> rows only or escape ValueError by adding an if clause to check for row column mistmatch:
      if len(row) != len(mydata.columns): break

    2. Not an error, but this table has some empty column names and cells that we don't need in a resulting DataFrame as it's rubbish. This can be filtered out in advance by checking for column names length. As for rows, check for empty cells and ones with unprintable Unicode char.

    That's it.
    Ohh, and btw, use either 4 spaces or tab ident, one space looks awful.
    Here is a working solution and output:

    import requests
    from bs4 import BeautifulSoup as bs
    import pandas as pd
    import re
    
    url = r'https://www.flightradar24.com/data/aircraft/ja11jc'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0)'}
    r = requests.get('https://www.flightradar24.com/data/aircraft/ja11jc', headers=headers)
    soup = bs(r.content,'html5lib')
    table = soup.find("table", id='tbl-datatable')
    
    skip_column = 1
    
    thead = table.find('thead')
    headers = []
    for i in thead.find_all('th')[skip_column:]:
        headers.append(i.text) if len(i.text) > 1 else None
    df = pd.DataFrame(columns=headers)
    
    tbody = table.find('tbody')
    for j in tbody.find_all('tr'):
        row = j.find_all('td')
        row = [re.sub(r'\s+',' ',i.text.strip(' ')) for i in row if i.text != '' and '\xa0' not in i.text]
        df.loc[len(df)] = row[skip_column:]
    
    df
    
    Output:
        DATE        FROM                TO                  FLIGHT  FLIGHT TIME STD     ATD     STA     STATUS
    0   11 Dec 2022 Tokunoshima (TKN)   Kagoshima (KOJ)     JL3798  —           10:00   —       11:10   Scheduled
    1   11 Dec 2022 Amami (ASJ)         Tokunoshima (TKN)   JL3843  —           08:55   —       09:30   Scheduled
    2   11 Dec 2022 Tokunoshima (TKN)   Amami (ASJ)         JL3844  —           07:45   —       08:15   Scheduled
    ... ...         ...                 ...                 ...     ...         ...     ...     ...     ...
    65  04 Dec 2022 Kagoshima (KOJ)     Amami (ASJ)         JL3465  1:05        05:25   05:26   06:40   Landed 06:31
    66  04 Dec 2022 Amami (ASJ)         Kagoshima (KOJ)     JL3724  0:53        01:45   01:50   02:50   Landed 02:43
    67  04 Dec 2022 Kagoshima (KOJ)     Amami (ASJ)         JL3725  0:56        00:00   00:05   01:15   Landed 01:02
    

    other methods of extracting the information into an excel file but i have had no luck so far

    To save the resulting df to an excel file run:

    df.to_excel("ja11jc.xlsx")