I am using this NLTK code to generate sentences from demo_grammar (see below), and the problem is that with grammar rules like N N or N N N I end up with sentences like "creation creation creation". I am only interested in generating sentences where the same word doesn't occur twice (i.e. creation video software).
How could I do that?
The generate.py
from NLTK is as such: https://github.com/nltk/nltk/blob/develop/nltk/parse/generate.py
I have tried the demo code from the generate.py
:
from nltk.grammar import CFG
from nltk.parse import generate
demo_grammar = """
S -> NP VP
NP -> Det N
PP -> P NP
VP -> 'slept' | 'saw' NP | 'walked' PP
Det -> 'the' | 'a'
N -> 'man' | 'park' | 'dog'
P -> 'in' | 'with'
"""
def demo(N=23):
print('Generating the first %d sentences for demo grammar:' % (N,))
print(demo_grammar)
grammar = CFG.fromstring(demo_grammar)
for n, sent in enumerate(generate(grammar, n=N), 1):
print('%3d. %s' % (n, ' '.join(sent)))
You can rewrite the grammar as suggested by alexis, this means several list of terms (nouns, verbs,...) for a specific place in each sentence.
But you can also apply a post-filtering strategy (don't have to touch grammar) :
Here is the filter you can apply :
from collections import Counter
f=lambda sent:False if Counter(sent.split(" ")).most_common(1)[0][1] > 1 else True
f("creation video software") # return True, good sentence
f("creation creation creation") # return False, bad sentence
f("creation software creation") # return False, bad sentence