Is there a way to get the distance of a Noun from the Verb from multiple sentences in a csv file using NLTK and Python?
Example of sentences in a .csv file:
video shows adam stabbing the bystander.
woman quickly ran from the police after the incident.
Output:
1st sentence: 1 (Verb is right after the noun)
2nd sentence:
2 (Verb is after another POS tag)
Inspired by the very similar question Extract nouns and verbs using nltk?.
import nltk
def dist_noun_verb(text):
text = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(text)
last_noun_pos = None
for pos, (word, function) in enumerate(pos_tagged):
if function.startswith('NN'):
last_noun_pos = pos
elif function.startswith('VB'):
assert(last_noun_pos is not None)
return pos - last_noun_pos
for sentence in ['Video show Adam stabbing the bystander.', 'Woman quickly ran from the police after the incident.']:
print(sentence)
d = dist_noun_verb(sentence)
print('Distance noun-verb: ', d)
Output:
Video show Adam stabbing the bystander.
Distance noun-verb: 1
Woman quickly ran from the police after the incident.
Distance noun-verb: 2
Note that function.startswith('VB')
detects the first verb in the sentence. If you want to make a distinction between the principal verb or some other kind of verb you need to examine the different kinds of verbs classified by nltk.pos_tagged
: 'VBP', 'VBD', etc.
Also, the assert(last_noun_pos is not None)
line in my code means the code will crash if the first verb comes before any noun. You might want to handle that differently.
Interestingly, if I add an 's'
to 'show'
and make the sentence 'Video shows Adam stabbing the bystander.'
, then nltk parses 'shows'
as a noun rather than a verb.
Consider the sentence:
'The umbrella that I used to protect myself from the rain was red.'
This sentence contains three verbs: 'used', 'protect', 'was'
. Using nltk.word_tokenize.pos_tag
as I did above would correctly identify those three verbs:
text = 'The umbrella that I used to protect myself from the rain was red.'
tokens = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(tokens)
print(pos_tagged)
# [('The', 'DT'), ('umbrella', 'NN'), ('that', 'IN'), ('I', 'PRP'), ('used', 'VBD'), ('to', 'TO'), ('protect', 'VB'), ('myself', 'PRP'), ('from', 'IN'), ('the', 'DT'), ('rain', 'NN'), ('was', 'VBD'), ('red', 'JJ'), ('.', '.')]
print([(w,f) for w,f in pos_tagged if f.startswith('VB')])
# [('used', 'VBD'), ('protect', 'VB'), ('was', 'VBD')]
However, the main verb of the sentence is 'was'
; the other two verbs are part of the nominal group that forms the subject of the sentence, 'The umbrella that I used to protect myself from the rain'
.
Thus we might like to write a function dist_subject_verb
that returns the distance between the subject and the main verb 'was'
, rather than between the first verb 'used'
and the previous noun.
One way to identify the main verb is to parse the sentence into a tree, and ignore verbs that are located in subtrees, only considering the verb that is a direct child of the root.
The sentence should be parsed as something like:
((The umbrella) (that (I used) to (protect (myself) (from (the rain))))) (was) (red)
And now we can easily ignore 'used'
and 'protect'
, which are deep into subtrees, and only consider main verb 'was'
.
Parsing the sentence into a tree is a much more complex operation that just tokenizing it.
Here is a similar question that deals with parsing a sentence into a tree: