Search code examples
pythonnlpnltkpos-tagger

detecting POS tag pattern along with specified words


I need to identify certain POS tags before/after certain specified words, for example the following tagged sentence:

[('This', 'DT'), ('feature', 'NN'), ('would', 'MD'), ('be', 'VB'), ('nice', 'JJ'), ('to', 'TO'), ('have', 'VB')]

can be abstracted to the form "would be" + Adjective

Similarly:

[('I', 'PRP'), ('am', 'VBP'), ('able', 'JJ'), ('to', 'TO'), ('delete', 'VB'), ('the', 'DT'), ('group', 'NN'), ('functionality', 'NN')]

is of the form "am able to" + Verb

How can I go about checking for these type of a pattern in sentences. I am using NLTK.


Solution

  • Assuming you want to check literally for "would" followed by "be", followed by some adjective, you can do this:

    def would_be(tagged):
        return any(['would', 'be', 'JJ'] == [tagged[i][0], tagged[i+1][0], tagged[i+2][1]] for i in xrange(len(tagged) - 2))
    

    The input is a POS tagged sentence (list of tuples, as per NLTK).

    It checks if there are any three elements in the list such that "would" is next to "be" and "be" is next to a word tagged as an adjective ('JJ'). It will return True as soon as this "pattern" is matched.

    You can do something very similar for the second type of sentence:

    def am_able_to(tagged):
        return any(['am', 'able', 'to', 'VB'] == [tagged[i][0], tagged[i+1][0], tagged[i+2][0], tagged[i+3][1]] for i in xrange(len(tagged) - 3))
    

    Here's a driver for the program:

    s1 = [('This', 'DT'), ('feature', 'NN'), ('would', 'MD'), ('be', 'VB'), ('nice', 'JJ'), ('to', 'TO'), ('have', 'VB')]
    s2 = [('I', 'PRP'), ('am', 'VBP'), ('able', 'JJ'), ('to', 'TO'), ('delete', 'VB'), ('the', 'DT'), ('group', 'NN'), ('functionality', 'NN')]
    
    def would_be(tagged):
       return any(['would', 'be', 'JJ'] == [tagged[i][0], tagged[i+1][0], tagged[i+2][1]] for i in xrange(len(tagged) - 2))
    
    def am_able_to(tagged):
        return any(['am', 'able', 'to', 'VB'] == [tagged[i][0], tagged[i+1][0], tagged[i+2][0], tagged[i+3][1]] for i in xrange(len(tagged) - 3))
    
    sent1 = ' '.join(s[0] for s in s1)
    sent2 = ' '.join(s[0] for s in s2)
    
    print("Is '{1}' of type 'would be' + adj? {0}".format(would_be(s1), sent1))
    print("Is '{1}' of type 'am able to' + verb? {0}".format(am_able_to(s1), sent1))
    
    print("Is '{1}' of type 'would be' + adj? {0}".format(would_be(s2), sent2))
    print("Is '{1}' of type 'am able to' + verb? {0}".format(am_able_to(s2), sent2))
    

    This correctly outputs:

    Is 'This feature would be nice to have' of type 'would be' + adj? True
    Is 'This feature would be nice to have' of type 'am able to' + verb? False
    Is 'I am able to delete the group functionality' of type 'would be' + adj? False
    Is 'I am able to delete the group functionality' of type 'am able to' + verb? True
    

    If you'd like to generalize this, you can change whether you're checking the literal words or their POS tag.