Search code examples
python-2.7wikipedia-apipywikibot

How to get specific Wikipedia page section?


I want to create a graph database of actors and the movies in which they've acted. To get the list of actors and movies, I'm trying to use the pywikibot parser, but I've only been able to get the full page, when I just want the filmography portion of the page. Is there a way to parse the page so I can just obtain the filmography? Here's what I've done so far:

import pywikibot as pw

site = pw.Site()
page = pw.Page(site, actor_name) #will be put into loop to get multiple actors
print page.text #returns full text of the page in format below
print page.linkedPages #returns linked pages

One idea is had was to return all the linked pages associated with the actor, since most movies are linked. The format in which I get the text data is as follows:

{{Infobox person
| name         = 
| birth name   =
}}

Summary

==Early life==

==Career==

==Filmography==

What can I do to only get the Filmography portion of the page?


Solution

  • You can do it with Wikipedia API. For example, to get Filmography section for William Alland you need to get the index of the section with name "Filmography" by:

    https://en.wikipedia.org/w/api.php?action=parse&prop=sections&page=William Alland
    

    From response we see that it is 2. Then we have to use that index to get the text only in this section:

    https://en.wikipedia.org/w/api.php?action=parse&prop=text&section=2&page=William Alland
    

    Note: Use prop=wikitext instead text to get the content in wiki format.