Search code examples
pythonsyntaxnlptriples

Extract SVO triples from preprocessed text


I need to extract subject-verb-object triples from a Dutch text. The text is analysed by a Dutch NLP tool named Frog which tokenized, parsed, tagged, lemmatized,...it. Frog produces FoLiA XML, or tab-delimited column-formatted output, one line per token. Because of some problems with the XML file, I chose to work with the column format. This example represents one sentence. enter image description here Now I need to extract per sentence the SVO triples, therefore I need the last column which are the dependency relations. So I need to get the ROOT element and the su and obj1 elements which belong to the ROOT. Unfortunately the example sentence has no obj1. Let's pretend it has. My idea was to first create a nested list with a list per sentence.

    import csv
    with open('romanfragment_frogged.tsv','r') as f:
         reader = csv.reader(f,delimiter='\t')
         tokens = []
         sentences = []
         list_of_sents = []
         for line in reader:
             tokens.append(line)
             #print(tokens)
             for token in tokens:
                 if token == '1':
                    previous_sentence = list_of_sents
                    sentences.append(previous_sentence)
         list_of_sents = []
         list_of_sents.append(tokens)
         print(list_of_sents)

When I print 'tokens', I get one list with all the tokens. So that is correct, but I'm still trying to create a nested list with 1 list (of tokens) per sentence. Can someone help me with this problem?

(P.S. the second problem is that I'm not sure, how to continue once I get a nested list)


Solution

  • Maybe something like this could work:

    def iter_sentences(fn):
        with open(fn, 'r') as f:
             reader = csv.reader(f,delimiter='\t')
             sentence = []
             for row in reader:
                 if not row:
                    # Ignore blank lines.
                    continue
                 if row[0] == '1' and sentence:
                     # A new sentence started.
                     yield sentence
                     sentence = []
                 sentence.append(row)
             # Last sentence.
             if sentence:
                 yield sentence
    
    def iter_triples(fn):
        for sentence in iter_sentences(fn):
            # Get all subjects and objects.
            subjects = [tok for tok in sentence if tok[-1] == 'su']
            objects = [tok for tok in sentence if tok[-1] == 'obj1']
            # Now try to map them: find pairs with a head in the same position.
            for obj in objects:
                for subj in subjects:
                    # row[-2] is the position of the head.
                    if subj[-2] == obj[-2]:
                        # Matching subj-obj pair found.
                        # Now get the verb (the head of both subj and obj).
                        # Its position is given in the second-to-last column.
                        position = int(subj[-2])
                        # Subtract 1, as the positions start counting at 1.
                        verb = sentence[position-1]
                        yield subj, verb, obj
    
    for subj, verb, obj in iter_triples('romanfragment_frogged.tsv'):
        # Only print the surface forms.
        print(subj[1], verb[1], obj[1])
    

    Quick explanation: iter_sentences iterates over sentences. Each sentence is a nested list: It's a list of tokens, and each token is a list itself (containing the row number, surface form, lemma, POS, dependency etc.). The iter_triples function iterates over triples ‹subject, verb, object›. Each element of these triples represents a token (ie. a list, again).

    The last three lines of code are just an example of how to use the iter_triples function. I don't know how much and which information you need from each triple...