Search code examples
pythonregexpython-3.xnltkpos-tagger

How do I apply any regex to my tagged text in python 3


I have a text. I tokenize it and remove stopwords. then I tag these words using stanford POS tagger in python. For now, I am using this code for tagging words and writing it in a file.

tag = nltk.pos_tag(filtered_sentence)
print("tagging the words")
fh = open("Stop_Words.txt", "w+")
for i in range(0,len(filtered_sentence)):
    fh.write((tag[i][0])+" "+(tag[i][1])+"\n")
fh.close()

Now I get a list something like this in my file:

paper NN
parallel NN
programming VBG
practical JJ
Greg NNP
Wilson NNP
intended VBD
scientist NN
interested JJ
... A big List ...

What I want to do now is to apply some Regex to this to find particular cases. For example, I want something like (JJ*N+) which means adjective followed by any noun. I did N+ because NN,NNP etc all are nouns.

How should I do this. I am clueless.Any help will be appreciated.


Solution

  • If you only want JJ*N you could do something like this:

    import re
    
    text = '''paper NN
    parallel NN
    programming VBG
    practical JJ
    Greg NNP
    Wilson NNP
    intended VBD
    scientist NN
    interested JJ
    '''
    
    pattern = re.compile('\w+? JJ\n\w+ NN.?', re.DOTALL)
    
    result = pattern.findall(text)
    print(result)
    

    Output

    ['practical JJ\nGreg NNP']
    

    Explanation

    The pattern '\w+? JJ\n\w+ NN.?' matches a group of letters \w+, followed by a space followed by JJ followed by a \n followed by another group of letters followed by something with NN prefix. Note that I included both words because I think it might be useful for your purposes.

    UPDATE

    If you want zero or more adjectives JJ* followed by one or more nouns NN+ you could do something like this:

    import re
    
    text = '''paper NN
    parallel NN
    programming VBG
    practical JJ
    Greg NNP
    Wilson NNP
    intended VBD
    scientist NN
    interested JJ
    '''
    
    pattern = re.compile('(\w+? JJ\n)*(\w+ NN\w?)+', re.DOTALL)
    
    result = pattern.finditer(text)
    for element in result:
        print(element.group())
        print('----')
    

    Output

    paper NN
    ----
    parallel NN
    ----
    practical JJ
    Greg NNP
    ----
    Wilson NNP
    ----
    scientist NN
    ----