I have a line of text that I want to make into N3 format so i can eventually change them to RDF. Each line of the text file has an entry like this:
09827177 18 n 03 aristocrat 0 blue_blood 0 patrician 0 013 @ 09646208 n 0000 #m 08404938 n 0000 + 01594891 a 0306 + 01594891 a 0102 ~ 09860027 n 0000 ~ 09892248 n 0000 ~ 10103592 n 0000 ~ 10194721 n 0000 ~ 10304832 n 0000 ~ 10492384 n 0000 ~ 10493649 n 0000 ~ 10525325 n 0000 ~ 10526235 n 0000 | a member of the aristocracy
I am trying to make triples out of the above statement so they will look like the table below.
Subject Predicate Object
(synset_offset)
09807754 lex_filenum 18
09807754 ss_type n
09807754 lexical_entry aristocrat
09807754 lexical_entry blue_blood
09807754 lexical_entry patrician
09807754 has_pointer 09623038
09623038 ss_type n
09623038 source_target 0000
09807754 description a member of aristocracy
I have been able to read most of the variables from each line of the text using this:
f = open("wordnetSample.txt", "r")
for line in f:
L = line.split()
L2 = line.split('|')
synset_offset = L[0]
lex_filenum = L[1]
ss_type = L[2]
word = (L[4:4 + 2 * int(L[3]):2])
gloss = (L2[1].split('\n')[0])
The problem I am having is that I don't know what namespaces to use or anything like that. I am new to this style of formatting and to python in general. I have been researching and feel it should be something like this:
'''<http://example.org/#'''+synset_offset+'''> <http://xmlns.com/foaf/0.1/lex_filenum> '''+lex_filenum+''' .
I have also been told that Turtle notation may be a better option, but i just cant get my head around it.
In RDF, resources and properties are identified by IRIs. The choice of how you select resource and property IRIs is really up to you. If you have own a domain name, you might choose to use IRIs based on that. If you are pulling data from someplace else, and it makes sense to use names based on that, you might choose to use IRIs based on that. If some of the resources or properties are already identified somewhere by IRIs, it's always good to try to reuse those, but it's not always easy to find those.
In your case, where the data is coming from WordNet, you should probably be very interested in the W3C Working Draft, RDF/OWL Representation of WordNet. I don't know whether the approaches and namespaces therein have been widely adopted or not, but the approach is surely something that you can learn something from. For instance
Each instance of Synset, WordSense and Word has its own URI. There is a pattern for the URIs so that (a) it is easy to determine from the URI the class to which the instance belongs; and (b) the URI provides some information on the meaning of the entity it represents. For example, the following URI
http://www.w3.org/2006/03/wn/wn20/instances/synset-bank-noun-2
is a NounSynset. This NounSynset contains a WordSense which is the first sense of the word "bank". The pattern for instances of Synset is: wn20instances: + synset- + %lexform%- + %type%- + %sensenr%. The %lexform% is the lexical form of the first WordSense of the Synset (the first WordSense in the Princeton source as signified by its "wordnumber", see Overview of the WordNet Prolog distribution). The %type% is one of noun, verb, adjective, adjective satellite and adverb. The %sensenr% is the number of the WordSense that is contained in the synset. This pattern produces a unique URI because the WordSense uniquely identifies the synset (a WordSense belongs to exactly one Synset).
The schema also defines lots of properties for the WordNet schema. You should probably reuse these IRIs where possible.