I must be missing something fundamental about recursively defined nonterminals giving issue, but all I want to do is to recognize something like a regular expression, where a series of numbers followed by a series of letters.
from nltk import CFG
import nltk
grammar = CFG.fromstring("""
S -> N L
N -> N | '1' | '2' | '3'
L -> L | 'A' | 'B' | 'C'
""")
from nltk.parse import BottomUpChartParser
parser = nltk.ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
the above code returns an empty parse and nothing is printed
The grammar would need to be something like:
from nltk import CFG, ChartParser
grammar = CFG.fromstring("""
S -> N L
N -> N N | '1' | '2' | '3'
L -> L L | 'A' | 'B' | 'C'
""")
parser = ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
break
<script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet@latest/snippet.min.js"></script>
Using N -> N N
as an example: the first N
could be "eaten up" and transformed into a 1
when parsing the sentence, leaving the next N
to go on and produce another N -> N N
.
But this will result in a lot of possible parses, for something more efficient you probably want something like this:
from nltk import CFG, ChartParser
grammar = CFG.fromstring("""
S -> N L
N -> '1' N | '2' N | '3' N | '1' | '2' | '3'
L -> 'A' L | 'B' L | 'C' L | 'A' | 'B' | 'C'
""")
parser = ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
break
<script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet@latest/snippet.min.js"></script>
Regular Version. The language from the question: "one or more numbers followed by one or more letters" or (1,2,3)+(A,B,C)+
is a regular language, so we can represent it with a regular grammar:
from nltk import CFG, ChartParser
grammar = CFG.fromstring("""
S -> N
N -> '1' N | '2' N | '3' N | '1' | '2' | '3' | L
L -> 'A' L | 'B' L | 'C' L | 'A' | 'B' | 'C'
""")
parser = ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
break
<script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet@latest/snippet.min.js"></script>
Try all three out and see what the parses look like on different inputs!