scala rdf wikipedia semantic-web dbpedia

how can I extract data of list pages from wikipedia?

what exactly I want to do is:

input: wikipedia xml dump

output: a list of triples like this:

<http://dbpedia.org/resource/Lists_of_computer_languages> <http://dbpedia.org/ontology/wikiListOf> <http://dbpedia.org/resource/C_(programming_language)> .

<http://dbpedia.org/resource/Lists_of_computer_languages> <http://dbpedia.org/ontology/wikiListOf> <http://dbpedia.org/resource/Java_(programming_language)> .

...

..

.

<http://dbpedia.org/resource/List_of_XML_markup_languages> <http://dbpedia.org/ontology/wikiListOf> <http://dbpedia.org/resource/AdsML> .

<http://dbpedia.org/resource/List_of_XML_markup_languages> <http://dbpedia.org/ontology/wikiListOf> <http://dbpedia.org/resource/Agricultural_Ontology_Service> .

We have already set up and customised dbpedia extraction framework but I think it would be difficult to configure the framework for extracting this data. I was shocked by the fact that extraction framework does not have any extractors for this !

Solution

All the framework extractors look for specific patters in an article name, or in an article body. If you can identify something in the list pages that do not exist in any other article then it will be able to create one...

otherwise you can use the pagelinks (links from page to page) and filter the articles you want. This will probably give you what you want (sort of)