Search code examples
parsingnltk

Parsing with nltk of a Recursive Grammar


I must be missing something fundamental about recursively defined nonterminals giving issue, but all I want to do is to recognize something like a regular expression, where a series of numbers followed by a series of letters.

from nltk import CFG
import nltk

grammar = CFG.fromstring("""
S -> N L
N -> N | '1' | '2' | '3'
L -> L | 'A' | 'B' | 'C'
""")

from nltk.parse import BottomUpChartParser
parser = nltk.ChartParser(grammar)

sentence = '1 2 1 3 A C B C'.split()

for t in parser.parse(sentence):
    print(t)

the above code returns an empty parse and nothing is printed


Solution

  • The grammar would need to be something like:

    from nltk import CFG, ChartParser
    
    grammar = CFG.fromstring("""
    S -> N L
    N -> N N | '1' | '2' | '3'
    L -> L L | 'A' | 'B' | 'C'
    """)
    
    parser = ChartParser(grammar)
    sentence = '1 2 1 3 A C B C'.split()
    for t in parser.parse(sentence):
        print(t)
        break
    <script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet@latest/snippet.min.js"></script>

    Using N -> N N as an example: the first N could be "eaten up" and transformed into a 1 when parsing the sentence, leaving the next N to go on and produce another N -> N N.

    But this will result in a lot of possible parses, for something more efficient you probably want something like this:

    from nltk import CFG, ChartParser
    
    grammar = CFG.fromstring("""
    S -> N L
    N -> '1' N | '2' N | '3' N | '1' | '2' | '3'
    L -> 'A' L | 'B' L | 'C' L | 'A' | 'B' | 'C'
    """)
    
    parser = ChartParser(grammar)
    sentence = '1 2 1 3 A C B C'.split()
    for t in parser.parse(sentence):
        print(t)
        break
    <script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet@latest/snippet.min.js"></script>


    Regular Version. The language from the question: "one or more numbers followed by one or more letters" or (1,2,3)+(A,B,C)+ is a regular language, so we can represent it with a regular grammar:

    from nltk import CFG, ChartParser
    
    grammar = CFG.fromstring("""
    S -> N
    N -> '1' N | '2' N | '3' N | '1' | '2' | '3' | L
    L -> 'A' L | 'B' L | 'C' L | 'A' | 'B' | 'C'
    """)
    
    parser = ChartParser(grammar)
    sentence = '1 2 1 3 A C B C'.split()
    for t in parser.parse(sentence):
        print(t)
        break
    <script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet@latest/snippet.min.js"></script>

    Try all three out and see what the parses look like on different inputs!