Search code examples
pythonnltkchunking

Chunking - regular expressions and trees


I'm a total noob so sorry if I'm asking something obvious. My question is twofold, or rather it's two questions in the same topic:

  1. I'm studying nltk in Uni, and we're doing chunks. In the grammar I have on my notes the following code:
grammar = r"""
            NP: {<DT|PP\$>?<JJ>*<NN.*>+} # noun phrase
            PP: {<IN><NP>}               # prepositional phrase
            VP: {<MD>?<VB.*><NP|PP>}     # verb phrase
            CLAUSE: {<NP><VP>}           # full clause
        """

What is the "$" symbol for in this case? I know it's "end of the line" in regex, but what does it stand for here?

  1. Also, in my text book there's a Tree that's been printed without using the .draw() function, to this result:
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])

How the heck one does that???

Thanks in advance to anybody who'll have the patience to school this noob :D


Solution

  • This is the code of your example:

    import nltk
    
    sentence = [('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')]
    
    grammar = r"""
                NP: {<DT|PP\$>?<JJ>*<NN.*>+} # noun phrase
                PP: {<IN><NP>}               # prepositional phrase
                VP: {<MD>?<VB.*><NP|PP>}     # verb phrase
                CLAUSE: {<NP><VP>}           # full clause
            """
    
    cp = nltk.RegexpParser(grammar) 
    result = cp.parse(sentence)
    
    print(result)
    
    #output
    #(S(CLAUSE (NP the/DT book/NN) (VP has/VBZ (NP many/JJ chapters/NNS))))
    
    
    result.draw()
    

    The tree of:

    Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
    

    enter image description here

    I found this link where you can learn a lot.

    The $ symbol is a special character in regular expressions, and must be backslash escaped in order to match the tag PP$.

    $ Example:

    Xyz$  ->  Used to match the pattern xyz at the end of a string