I want to extract phrases that have "of" between two nouns. This is my code:
import nltk
text = "I live in Kingdom of Spain"
tokens = nltk.word_tokenize(text)
tag = nltk.pos_tag(tokens, tagset='universal')
print(tag)
regexes = '''PHRASE:{<NOUN>of<NOUN>}'''
noun_phrase_regex = nltk.RegexpParser(regexes)
result = noun_phrase_regex.parse(tag)
result = list(result)
print(result)
Unfortunately, I do not get Tree
in my result, so my regex do not work good.
I have tried:
{<NOUN>(of)<NOUN>}
{<NOUN>{of}<NOUN>}
{<NOUN>of<NOUN>}
{<NOUN><of><NOUN>}
But result is the same.
Also, when I get the result, how can I extract the Tree
values from the list, for now, Im doing that like this:
result = [element for element in result if type(element) != tuple]
result = [" ".join([word[0] for word in tup_phrase]) for tup_phrase in result]
print(result)
It isn't possible to mix words and POS tags in an nltk parser grammar.
You can still achieve what you want by other means though. For example you can match all POS tags that match your requirement and then check the result set for those which contain 'of'
, and whichever variations of that word that you want (e.g. w/some capitals). That would work like so:
import nltk
text = "I live in the Kingdom of Spain"
tokens = nltk.word_tokenize(text)
tag = nltk.pos_tag(tokens, tagset='universal')
print(tag)
regexes = 'CHUNK: {<NOUN> <ADP> <NOUN>}'
noun_phrase_regex = nltk.RegexpParser(regexes)
result = noun_phrase_regex.parse(tag)
tree = noun_phrase_regex.parse(tag)
chunks = []
for subtree in tree.subtrees():
if subtree.label() == 'CHUNK':
chunks.append(subtree)
found = []
for chunk in chunks:
leaves = chunk.leaves()
if leaves[1][0] == 'of':
found.append(' '.join([word for word, _ in leaves]))
print(found)
This will give you:
>>> print(found)
['Kingdom of Spain']
>>> nltk.__version__
'3.7'