Search code examples
rdfsemantic-webfreebasedbpediafreebase-acre

Freebase rdf dump parsing for Name-Type exctraction..?


I have parsed freebase data dump and now have RDF like the following:

<http://rdf.freebase.com/ns/m.0mspb64> <http://rdf.freebase.com/ns/type.object.type> <http://rdf.freebase.com/ns/music.release_track>
<http://rdf.freebase.com/ns/m.0mspb64> <http://rdf.freebase.com/ns/type.object.name> "Mit Rees und Hans im Bürgli"@de
<http://rdf.freebase.com/ns/m.0mspd6m> <http://rdf.freebase.com/ns/type.object.type> <http://rdf.freebase.com/ns/music.release_track>
<http://rdf.freebase.com/ns/m.0mspd6m> <http://rdf.freebase.com/ns/type.object.name> "Granny Scratch Scratch"@en

Having this rdf dataset, how can I extract the name and type of a particular resource? For instance, from the data above, I want to extract:

Mit Rees und Hans im Bürgli ### music.release_track
Granny Scratch Scratch ### music.release_track 

Solution

  • What did you use to parse it? The format that you're showing is the raw data format.

    If you've loaded it into an RDF store, you should be able to easily query to get the information you need using SPARQL or whatever other query interface the store offers.

    If you're just working with raw text file, you should be able to take advantage of the fact that it's sorted by subject ID (you should verify that this is still true) to process it as a stream without requiring lots of working storage (ie RAM).

    The only temporary storage that you need is 1) the current subject ID, 2) the name of the current subject and 3) the type of the current subject. If the type isn't the one you want (release_track), you can just skip to the next group of subject triples. If it is the right type, you can output a line for your triple as soon as you have both the name and the type.