Search code examples
pythonapiwikipediawikipedia-api

WIkipedia API get text under headers


I can scripe a wikipedia usein wikipedia api

import wikipedia
import re
page = wikipedia.page("Albert Einstein")
text = page.content
regex_result = re.findall("==\s(.+?)\s==", text)
print(regex_result)

and I can from every element in a regex_result(Wikipedia headers ) get a text bellow and append it to another list. I dug the internet and I do not know how to do that with some function in Wikipedia API. Second chance to get it in get a text and with some module extract a text between headers more here: find a some text in string bettwen some specific characters

I have tried this:

l = 0
for n in regex_result:
    try:
        regal = re.findall(f"==\s{regex_result[l]}\s==(.+?)\s=={regex_result[l+1]}\s==", text)
        l+=2
    except Exception:
        continue

But I am not working: output is only []


Solution

  • You don't want to call re twice, but rather iterate directly through the results provided by regex_result. Named groups in the form of (?P<name>...) make it even easier to extract the header name without the surrounding markup.

    import wikipedia
    import re
    page = wikipedia.page("Albert Einstein")
    text = page.content
    # using the number 2 for '=' means you can easily find sub-headers too by increasing the value 
    regex_result = re.findall("\n={2}\s(?P<header>.+?)\s={2}\n", text)
    

    regex_result will then be a list of strings of the all the top-level section headers.

    Here's what I use to make a table of contents from a wiki page. (Note: f-strings require Python 3.6)

    def get_wikiheader_regex(level):
        '''The top wikiheader level has two = signs, so add 1 to the level to get the correct number.'''
        assert isinstance(level, int) and level > -1
        header_regex = f"^={{{level+1}}}\s(?P<section>.*?)\s={{{level+1}}}$"
    
        return header_regex
    
    def get_toc(raw_page, level=1):
        '''For a single raw wiki page, return the level 1 section headers as a table of contents.'''
        toc = []
        header_regex = get_wikiheader_regex(level=level)
        for line in raw_page.splitlines():
            if line.startswith('=') and re.search(header_regex, line):
                toc.append(re.search(header_regex, line).group('section'))
    
        return toc
     >>> get_toc(text)