I am using BeautifulSoup
to scrape the first wikitable on the page List of military engagements during the Russian invasion of Ukraine to get the names of all 57 battles. I have attached an image of the table's HTML for reference: HTML of the wikitable.
To grab all the <a>
elements in the first column and get just the text (the battle names), I did the following:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_military_engagements_during_the_Russian_invasion_of_Ukraine'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('table')
rows = table.find_all('tr')
battlenames = []
for row in rows:
# Find the first <td> element within the row
td_element = row.find('td')
if td_element:
# Find the first <a> element within the <td> element
battlename = td_element.find('a')
cleanname = battlename.text
battlenames.append(cleanname)
for name in battlenames:
print(name)
I ran this in both Spyder and Jupyter Notebook and got the following error:
AttributeError Traceback (most recent call last)
Cell In[6], line 18
15 if td_element:
16 # Find the first <a> element within the <td> element
17 battlename = td_element.find('a')
---> 18 cleanname = battlename.text
19 battlenames.append(cleanname)
21 for name in battlenames:
AttributeError: 'NoneType' object has no attribute 'text'
This surprised me because the first <td>
element of every row (<tr>
) contains an <a>
element with the battle name. I.e., there are no empty boxes in the table's first column that would cause a NoneType error. What could be the issue?
Based on comment from @Ouroboros1 to be more precise, the issue is exactly, that there are elements of td
that do not contain a a
.
table contains one "sub" tr for "Battles of Voznesensk", where the first td fills "9 March 2022" in the "Start date" column. Now, this td just happens to have no link
a
So you have also to check if there is an a
before calling .text
:
if td_element:
# Find the first <a> element within the <td> element
battlename = td_element.find('a')
# check hier if also a is available
if battlename:
cleanname = battlename.text
battlenames.append(cleanname)
You could also try to change your selection strategy, may use css selectors
to select only tr
with td
that contains a
:
soup.table.select('tr:has(td:first-of-type a)')
or even directly all a
in first td
of tr
:
soup.table.select('tr td:first-of-type a')
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_military_engagements_during_the_Russian_invasion_of_Ukraine'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
#Option A
for row in soup.table.select('tr:has(td:first-of-type a)'):
print(row.td.a.text)
#Option B
for a in soup.table.select('tr td:first-of-type a'):
print(a.text)