Search code examples
pythonhtmlweb-scrapingbeautifulsoupnonetype

NoneType error when trying to access .text attribute of an existent <a> element


I am using BeautifulSoup to scrape the first wikitable on the page List of military engagements during the Russian invasion of Ukraine to get the names of all 57 battles. I have attached an image of the table's HTML for reference: HTML of the wikitable.

To grab all the <a> elements in the first column and get just the text (the battle names), I did the following:

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_military_engagements_during_the_Russian_invasion_of_Ukraine'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('table')
rows = table.find_all('tr')

battlenames = []
for row in rows:
    # Find the first <td> element within the row
    td_element = row.find('td')
    if td_element:
        # Find the first <a> element within the <td> element
        battlename = td_element.find('a')
        cleanname = battlename.text
        battlenames.append(cleanname)

for name in battlenames:
    print(name)

I ran this in both Spyder and Jupyter Notebook and got the following error:

AttributeError                            Traceback (most recent call last)
Cell In[6], line 18
     15     if td_element:
     16         # Find the first <a> element within the <td> element
     17         battlename = td_element.find('a')
---> 18         cleanname = battlename.text
     19         battlenames.append(cleanname)
     21 for name in battlenames:

AttributeError: 'NoneType' object has no attribute 'text'

This surprised me because the first <td> element of every row (<tr>) contains an <a> element with the battle name. I.e., there are no empty boxes in the table's first column that would cause a NoneType error. What could be the issue?


Solution

  • EDIT

    Based on comment from @Ouroboros1 to be more precise, the issue is exactly, that there are elements of td that do not contain a a.

    table contains one "sub" tr for "Battles of Voznesensk", where the first td fills "9 March 2022" in the "Start date" column. Now, this td just happens to have no link a

    So you have also to check if there is an a before calling .text:

    if td_element:
        # Find the first <a> element within the <td> element
        battlename = td_element.find('a')
        # check hier if also a is available
        if battlename:
            cleanname = battlename.text
            battlenames.append(cleanname)
    

    You could also try to change your selection strategy, may use css selectors to select only tr with td that contains a:

    soup.table.select('tr:has(td:first-of-type a)')
    

    or even directly all a in first td of tr:

    soup.table.select('tr td:first-of-type a')
    

    Example css selectors

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://en.wikipedia.org/wiki/List_of_military_engagements_during_the_Russian_invasion_of_Ukraine'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'lxml')
    
    #Option A
    
    for row in soup.table.select('tr:has(td:first-of-type a)'):
            print(row.td.a.text)
    
    #Option B
    for a in soup.table.select('tr td:first-of-type a'):
        print(a.text)