Search code examples
dbpedia

Generating different datasets from live dbpedia dump


I was playing around with the different datasets provided at the dbpedia download page and found that it is kind of outdated.

Then I downloaded the latest dump from the dbpedia live site. When I extracted the June 30th file, I just got one huge 37GB .nt file.

I want to get different datasets (like the different .nt files available at the download page) from the latest dump. Is there a script or process to do it?


Solution

  • Solution 1:

    You can use dbpedia live extractor.https://github.com/dbpedia/extraction-framework. You need to configure proper extractors(Ex: infobox properties extractor, abstract extractor ..etc). It will download the latest wikipedia dumps and generates the dbpedia datasets.

    You may need to make some code changes to get only the required data. One of my colleague did this for German data sets. You still need a lot of disk space for this.

    Solution 2(I don't know whether it is really possible or not.):

    Do a grep for the required properties on the datasets. You need to know the exact URIs of the properties you want to get.

    ex: For getting all the home pages: bzgrep 'http://xmlns.com/foaf/0.1/homepage' dbpedia_2013_03_04.nt.bz2 >homepages.nt

    It will give you all the N-triples with homepages. You can load that in the rdf store.