Search code examples
pythonwikipediaattributeerrornonetype

Why is this returning a NoneType?


I'm trying to scrape info off of Wikipedia using the function below, but I'm running into an Attribute Error because a function call is returning None. Can someone please try and explain why this is returning None?

import wikipedia as wp
import string

def add_section_info(search):
    HTML = wp.page(search).html().encode("UTF-8") #gets HTML source from Wikipedia

    with open("temp.xml",'w') as t: #write HTML to xml format
        t.write(HTML)

    table_of_contents = []
    dict_of_section_info = {}

    #This extracts the info in the table of contents
    with open("temp.xml",'r') as r:
        for line in r:
            if "toclevel" in line: 
                new_string = line.partition("#")[2]
                content_title = new_string.partition("\"")[0]
                tbl = string.maketrans("_"," ")
                content_title = content_title.translate(tbl)
                table_of_contents.append(content_title)

    print wp.page(search).section("Aortic rupture") #this is None, but shouldn't be

    for item in table_of_contents:
        section = wp.page(search).section(item).encode("UTF-8")
        print section
        if section == "":
            continue
        else:
            dict_of_section_info[item] = section

    with open("Section_Info.txt",'a') as sect:
        sect.write(search)
        sect.write("------------------------------------------\n")
        for item in dict_of_section_info:
            sect.write(item)
            sect.write("\n\n")
            sect.write(dict_of_section_info[item])
        sect.write("####################################\n\n")

add_section_info("Abdominal aortic aneurysm")

What I don't understand is that if I run add_section_info("HIV"), for example, it works perfectly.

The source code for the imported wikipedia is here

My output on the above code is this:

Abdominal aortic aneurysm

Signs and symptoms
Traceback (most recent call last):
  File "/home/pharoslabsllc/Documents/wikitest.py", line 79, in <module>
add_section_info(line)
  File "/home/pharoslabsllc/Documents/wikitest.py", line 30, in add_section_info
    section = wp.page(search).section(item).encode("UTF-8")
AttributeError: 'NoneType' object has no attribute 'encode'

Solution

  • The page method never returns None (you can easily check this in the source code), however the section method does return None if the title cannot be found. See the documentation:

    section(section_title)

    Get the plain text content of a section from self.sections. Returns None if section_title isn’t found, otherwise returns a whitespace stripped string.

    So the answer is that the wikipedia page you are referring to has no section titled Aortic rupture, as far as the library is concerned.

    Looking at wikipedia itself it seems like the page Abdominal aortic aneurysm does have such a section.

    Note that if you try to check what the value of wp.page(search).sections is you get: []. I.e. it seems like the library isn't parsing the sections properly.


    From the source code of the library found here you can see this test:

    section = u"== {} ==".format(section_title)
    try:
      index = self.content.index(section) + len(section)
    except ValueError:
      return None
    

    However:

    In [14]: p.content.find('Aortic')
    Out[14]: 3223
    
    In [15]: p.content[3220:3220+50]
    Out[15]: '== Aortic ruptureEdit ===\n\nThe signs and symptoms '
    In [16]: p.section('Aortic ruptureEdit')
    Out[16]: "The signs and symptoms of a ruptured AAA may includes severe pain in the lower back, flank, abdomen or groin. A mass that pulses with the heart beat may also be felt. The bleeding can leads to a hypovolemic shock with low blood pressure and a fast heart rate. This may lead to brief passing out.\nThe mortality of AAA rupture is up to 90%. 65–75% of patients die before they arrive at hospital and up to 90% die before they reach the operating room. The bleeding can be retroperitoneal or into the abdominal cavity. Rupture can also create a connection between the aorta and intestine or inferior vena cava. Flank ecchymosis (appearance of a bruise) is a sign of retroperitoneal bleeding, and is also called Grey Turner's sign.\nAortic aneurysm rupture may be mistaken for the pain of kidney stones, muscle related back pain."
    

    Note the Edit ==. In other words the library has a bug that doesn't take into account the link to edit.

    The same code works with the page for HIV because in that page the headings don't have an edit link right next to them. I have no idea why this is so, anywyay it looks like either a bug or a shortcoming of the library, so you should open a ticket on its issue tracker.

    In the meanwhile you could use a simple fix like:

    def find_section(page, title):
        res = page.section(title)
        if res is None:
            res = page.section(title + 'Edit')
        return res
    

    and use this function instead of using the .section method. However this can only be a temporary fix.