Search code examples
pythonrdfwordnet

Choosing namespace prefixes for WordNet data in RDF


I have a line of text that I want to make into N3 format so i can eventually change them to RDF. Each line of the text file has an entry like this:

09827177 18 n 03 aristocrat 0 blue_blood 0 patrician 0 013 @ 09646208 n 0000 #m 08404938 n 0000 + 01594891 a 0306 + 01594891 a 0102 ~ 09860027 n 0000 ~ 09892248 n 0000 ~ 10103592 n 0000 ~ 10194721 n 0000 ~ 10304832 n 0000 ~ 10492384 n 0000 ~ 10493649 n 0000 ~ 10525325 n 0000 ~ 10526235 n 0000 | a member of the aristocracy

I am trying to make triples out of the above statement so they will look like the table below.

  Subject        Predicate           Object
(synset_offset)  

  09807754       lex_filenum           18
  09807754       ss_type               n
  09807754       lexical_entry     aristocrat
  09807754       lexical_entry     blue_blood
  09807754       lexical_entry     patrician
  09807754       has_pointer       09623038
  09623038       ss_type               n
  09623038       source_target        0000
  09807754       description    a member of aristocracy

I have been able to read most of the variables from each line of the text using this:

f = open("wordnetSample.txt", "r")
for line in f:
    L = line.split()
    L2 = line.split('|')
    synset_offset = L[0]
    lex_filenum = L[1]
    ss_type = L[2]
    word = (L[4:4 + 2 * int(L[3]):2])
    gloss = (L2[1].split('\n')[0])

The problem I am having is that I don't know what namespaces to use or anything like that. I am new to this style of formatting and to python in general. I have been researching and feel it should be something like this:

'''<http://example.org/#'''+synset_offset+'''> <http://xmlns.com/foaf/0.1/lex_filenum> '''+lex_filenum+''' .

I have also been told that Turtle notation may be a better option, but i just cant get my head around it.


Solution

  • In RDF, resources and properties are identified by IRIs. The choice of how you select resource and property IRIs is really up to you. If you have own a domain name, you might choose to use IRIs based on that. If you are pulling data from someplace else, and it makes sense to use names based on that, you might choose to use IRIs based on that. If some of the resources or properties are already identified somewhere by IRIs, it's always good to try to reuse those, but it's not always easy to find those.

    In your case, where the data is coming from WordNet, you should probably be very interested in the W3C Working Draft, RDF/OWL Representation of WordNet. I don't know whether the approaches and namespaces therein have been widely adopted or not, but the approach is surely something that you can learn something from. For instance

    Each instance of Synset, WordSense and Word has its own URI. There is a pattern for the URIs so that (a) it is easy to determine from the URI the class to which the instance belongs; and (b) the URI provides some information on the meaning of the entity it represents. For example, the following URI

    http://www.w3.org/2006/03/wn/wn20/instances/synset-bank-noun-2

    is a NounSynset. This NounSynset contains a WordSense which is the first sense of the word "bank". The pattern for instances of Synset is: wn20instances: + synset- + %lexform%- + %type%- + %sensenr%. The %lexform% is the lexical form of the first WordSense of the Synset (the first WordSense in the Princeton source as signified by its "wordnumber", see Overview of the WordNet Prolog distribution). The %type% is one of noun, verb, adjective, adjective satellite and adverb. The %sensenr% is the number of the WordSense that is contained in the synset. This pattern produces a unique URI because the WordSense uniquely identifies the synset (a WordSense belongs to exactly one Synset).

    The schema also defines lots of properties for the WordNet schema. You should probably reuse these IRIs where possible.