Distance of Noun from Verb

Is there a way to get the distance of a Noun from the Verb from multiple sentences in a csv file using NLTK and Python?

Example of sentences in a .csv file:

video shows adam stabbing the bystander.
woman quickly ran from the police after the incident.

Output:

1st sentence: 1 (Verb is right after the noun)

2nd sentence: 2 (Verb is after another POS tag)

Solution

Distance between first verb and previous noun

Inspired by the very similar question Extract nouns and verbs using nltk?.

import nltk

def dist_noun_verb(text):
    text = nltk.word_tokenize(text)
    pos_tagged = nltk.pos_tag(text)
    last_noun_pos = None
    for pos, (word, function) in enumerate(pos_tagged):
        if function.startswith('NN'):
            last_noun_pos = pos
        elif function.startswith('VB'):
            assert(last_noun_pos is not None)
            return pos - last_noun_pos

for sentence in ['Video show Adam stabbing the bystander.', 'Woman quickly ran from the police after the incident.']:
    print(sentence)
    d = dist_noun_verb(sentence)
    print('Distance noun-verb: ', d)

Output:

Video show Adam stabbing the bystander.
Distance noun-verb:  1
Woman quickly ran from the police after the incident.
Distance noun-verb:  2

Note that function.startswith('VB') detects the first verb in the sentence. If you want to make a distinction between the principal verb or some other kind of verb you need to examine the different kinds of verbs classified by nltk.pos_tagged: 'VBP', 'VBD', etc.

Also, the assert(last_noun_pos is not None) line in my code means the code will crash if the first verb comes before any noun. You might want to handle that differently.

Interestingly, if I add an 's' to 'show' and make the sentence 'Video shows Adam stabbing the bystander.', then nltk parses 'shows' as a noun rather than a verb.

Going further: distance between "main" verb and previous noun

Consider the sentence:

'The umbrella that I used to protect myself from the rain was red.'

This sentence contains three verbs: 'used', 'protect', 'was'. Using nltk.word_tokenize.pos_tag as I did above would correctly identify those three verbs:

text = 'The umbrella that I used to protect myself from the rain was red.'
tokens = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(tokens)
print(pos_tagged)
# [('The', 'DT'), ('umbrella', 'NN'), ('that', 'IN'), ('I', 'PRP'), ('used', 'VBD'), ('to', 'TO'), ('protect', 'VB'), ('myself', 'PRP'), ('from', 'IN'), ('the', 'DT'), ('rain', 'NN'), ('was', 'VBD'), ('red', 'JJ'), ('.', '.')]
print([(w,f) for w,f in pos_tagged if f.startswith('VB')])
# [('used', 'VBD'), ('protect', 'VB'), ('was', 'VBD')]

However, the main verb of the sentence is 'was'; the other two verbs are part of the nominal group that forms the subject of the sentence, 'The umbrella that I used to protect myself from the rain'.

Thus we might like to write a function dist_subject_verb that returns the distance between the subject and the main verb 'was', rather than between the first verb 'used' and the previous noun.

One way to identify the main verb is to parse the sentence into a tree, and ignore verbs that are located in subtrees, only considering the verb that is a direct child of the root.

The sentence should be parsed as something like:

((The umbrella) (that (I used) to (protect (myself) (from (the rain))))) (was) (red)

And now we can easily ignore 'used' and 'protect', which are deep into subtrees, and only consider main verb 'was'.

Parsing the sentence into a tree is a much more complex operation that just tokenizing it.

Here is a similar question that deals with parsing a sentence into a tree:

How to get parse tree using python nltk?