Search code examples
wikipediadbpediawikipedia-apiwikidata

Extract story plots from Wikipedia


Goal

I want to extract story plots from the English Wikipedia. I'm only looking for a few (~100) and the source of the plots doesn't matter, e.g. novels, video games, etc.

I briefly tried a few things that didn't work, and need some clarification on what I'm missing and where to direct my efforts. It would be nice if I could avoid manual parsing and could get just issue a single query.

Things I tried

1. markriedl/WikiPlots

This repo downloads the pages-articles dump, expands it using wikiextractor, then scans each article and saves the contents of each section whose title contains "plot". This is a heavy-handed method of achieving what I want, but I gave it a try and failed. I had to run wikiextractor inside Docker because there are known issues with Windows, and then wikiextractor failed because there is a problem with the --html flag.

I could probably get this working but it would take a lot of effort and there seemed like better ways.

2. Wikidata

I used the Wikidata SPARQL service and was able to get some queries working, but it seems like Wikidata only deals with metadata and relationships. Specifically, I was able to get novel titles but unable to get novel summaries.

3. DBpedia

In theory, DBpedia should be exactly what I want because it's "Wikipedia but structured", but they don't have nice tutorials and examples like Wikidata so I couldn't figure out how to use their SPARQL endpoint. Google wasn't much help either and seemed to imply that it's common to setup your own graph DB to query, which is beyond my scope.

4. Quarry

This is a new query service that lets you query several Wikimedia databases. Sounds promising but I was again unable to grab content.

5. PetScan & title download

This SO answer says I can query PetScan to get Wikipedia titles, download HTML from Wikipedia.org, then parse that HTML. This sounds like it would work, but PetScan looks intimidating and this involves HTML parsing that I want to avoid if possible.


Solution

  • There's no straightforward way to do this as Wikipedia content isn't structured as you would like it to be. I'd use petscan to get a list of articles based on the category, feed them in to e.g. https://en.wikipedia.org/w/api.php?action=parse&page=The%20Hobbit&format=json&prop=sections iterate through the sections and if the 'line' attribute == 'Plot' then call e.g. https://en.wikipedia.org/w/api.php?action=parse&page=The%20Hobbit&format=json&prop=text&section=2 where 'section' = 'number' of the section titled plot. That gives you html and I can't figure out how to just get the plain text, but you might be able to make sense of https://www.mediawiki.org/w/api.php?action=help&modules=parse