I have a text. I tokenize it and remove stopwords. then I tag these words using stanford POS tagger in python. For now, I am using this code for tagging words and writing it in a file.
tag = nltk.pos_tag(filtered_sentence)
print("tagging the words")
fh = open("Stop_Words.txt", "w+")
for i in range(0,len(filtered_sentence)):
fh.write((tag[i][0])+" "+(tag[i][1])+"\n")
fh.close()
Now I get a list something like this in my file:
paper NN
parallel NN
programming VBG
practical JJ
Greg NNP
Wilson NNP
intended VBD
scientist NN
interested JJ
... A big List ...
What I want to do now is to apply some Regex to this to find particular cases. For example, I want something like (JJ*N+) which means adjective followed by any noun. I did N+ because NN,NNP etc all are nouns.
How should I do this. I am clueless.Any help will be appreciated.
If you only want JJ*N you could do something like this:
import re
text = '''paper NN
parallel NN
programming VBG
practical JJ
Greg NNP
Wilson NNP
intended VBD
scientist NN
interested JJ
'''
pattern = re.compile('\w+? JJ\n\w+ NN.?', re.DOTALL)
result = pattern.findall(text)
print(result)
Output
['practical JJ\nGreg NNP']
Explanation
The pattern '\w+? JJ\n\w+ NN.?'
matches a group of letters \w+
, followed by a space followed by JJ followed by a \n
followed by another group of letters followed by something with NN
prefix. Note that I included both words because I think it might be useful for your purposes.
UPDATE
If you want zero or more adjectives JJ*
followed by one or more nouns NN+
you could do something like this:
import re
text = '''paper NN
parallel NN
programming VBG
practical JJ
Greg NNP
Wilson NNP
intended VBD
scientist NN
interested JJ
'''
pattern = re.compile('(\w+? JJ\n)*(\w+ NN\w?)+', re.DOTALL)
result = pattern.finditer(text)
for element in result:
print(element.group())
print('----')
Output
paper NN
----
parallel NN
----
practical JJ
Greg NNP
----
Wilson NNP
----
scientist NN
----