I'm trying to scrape info off of Wikipedia using the function below, but I'm running into an Attribute Error because a function call is returning None. Can someone please try and explain why this is returning None?
import wikipedia as wp
import string
def add_section_info(search):
HTML = wp.page(search).html().encode("UTF-8") #gets HTML source from Wikipedia
with open("temp.xml",'w') as t: #write HTML to xml format
t.write(HTML)
table_of_contents = []
dict_of_section_info = {}
#This extracts the info in the table of contents
with open("temp.xml",'r') as r:
for line in r:
if "toclevel" in line:
new_string = line.partition("#")[2]
content_title = new_string.partition("\"")[0]
tbl = string.maketrans("_"," ")
content_title = content_title.translate(tbl)
table_of_contents.append(content_title)
print wp.page(search).section("Aortic rupture") #this is None, but shouldn't be
for item in table_of_contents:
section = wp.page(search).section(item).encode("UTF-8")
print section
if section == "":
continue
else:
dict_of_section_info[item] = section
with open("Section_Info.txt",'a') as sect:
sect.write(search)
sect.write("------------------------------------------\n")
for item in dict_of_section_info:
sect.write(item)
sect.write("\n\n")
sect.write(dict_of_section_info[item])
sect.write("####################################\n\n")
add_section_info("Abdominal aortic aneurysm")
What I don't understand is that if I run add_section_info("HIV")
, for example, it works perfectly.
The source code for the imported wikipedia is here
My output on the above code is this:
Abdominal aortic aneurysm
Signs and symptoms
Traceback (most recent call last):
File "/home/pharoslabsllc/Documents/wikitest.py", line 79, in <module>
add_section_info(line)
File "/home/pharoslabsllc/Documents/wikitest.py", line 30, in add_section_info
section = wp.page(search).section(item).encode("UTF-8")
AttributeError: 'NoneType' object has no attribute 'encode'
The page
method never returns None
(you can easily check this in the source code), however the section
method does return None
if the title cannot be found. See the documentation:
section(section_title)
Get the plain text content of a section from
self.sections
. ReturnsNone
ifsection_title
isn’t found, otherwise returns a whitespace stripped string.
So the answer is that the wikipedia page you are referring to has no section titled Aortic rupture
, as far as the library is concerned.
Looking at wikipedia itself it seems like the page Abdominal aortic aneurysm does have such a section.
Note that if you try to check what the value of wp.page(search).sections
is you get: []
. I.e. it seems like the library isn't parsing the sections properly.
From the source code of the library found here you can see this test:
section = u"== {} ==".format(section_title)
try:
index = self.content.index(section) + len(section)
except ValueError:
return None
However:
In [14]: p.content.find('Aortic')
Out[14]: 3223
In [15]: p.content[3220:3220+50]
Out[15]: '== Aortic ruptureEdit ===\n\nThe signs and symptoms '
In [16]: p.section('Aortic ruptureEdit')
Out[16]: "The signs and symptoms of a ruptured AAA may includes severe pain in the lower back, flank, abdomen or groin. A mass that pulses with the heart beat may also be felt. The bleeding can leads to a hypovolemic shock with low blood pressure and a fast heart rate. This may lead to brief passing out.\nThe mortality of AAA rupture is up to 90%. 65–75% of patients die before they arrive at hospital and up to 90% die before they reach the operating room. The bleeding can be retroperitoneal or into the abdominal cavity. Rupture can also create a connection between the aorta and intestine or inferior vena cava. Flank ecchymosis (appearance of a bruise) is a sign of retroperitoneal bleeding, and is also called Grey Turner's sign.\nAortic aneurysm rupture may be mistaken for the pain of kidney stones, muscle related back pain."
Note the Edit ==
. In other words the library has a bug that doesn't take into account the link to edit.
The same code works with the page for HIV because in that page the headings don't have an edit
link right next to them. I have no idea why this is so, anywyay it looks like either a bug or a shortcoming of the library, so you should open a ticket on its issue tracker.
In the meanwhile you could use a simple fix like:
def find_section(page, title):
res = page.section(title)
if res is None:
res = page.section(title + 'Edit')
return res
and use this function instead of using the .section
method. However this can only be a temporary fix.