python nltk context-free-grammar linguistics

NLTK Generate sentences without two occurences of the same word in Python

I am using this NLTK code to generate sentences from demo_grammar (see below), and the problem is that with grammar rules like N N or N N N I end up with sentences like "creation creation creation". I am only interested in generating sentences where the same word doesn't occur twice (i.e. creation video software).

How could I do that?

The generate.py from NLTK is as such: https://github.com/nltk/nltk/blob/develop/nltk/parse/generate.py

I have tried the demo code from the generate.py:

from nltk.grammar import CFG
from nltk.parse import generate    

demo_grammar = """
  S -> NP VP
  NP -> Det N
  PP -> P NP
  VP -> 'slept' | 'saw' NP | 'walked' PP
  Det -> 'the' | 'a'
  N -> 'man' | 'park' | 'dog'
  P -> 'in' | 'with'
"""

def demo(N=23):

    print('Generating the first %d sentences for demo grammar:' % (N,))
    print(demo_grammar)
    grammar = CFG.fromstring(demo_grammar)
    for n, sent in enumerate(generate(grammar, n=N), 1):
        print('%3d. %s' % (n, ' '.join(sent)))

Solution

You can rewrite the grammar as suggested by alexis, this means several list of terms (nouns, verbs,...) for a specific place in each sentence.

But you can also apply a post-filtering strategy (don't have to touch grammar) :

generate all possible sentences with your grammar, even sentences with words occuring twice or more
apply a filter that removes all sentences with words occuring twice or more

Here is the filter you can apply :

from collections import Counter
f=lambda sent:False if Counter(sent.split(" ")).most_common(1)[0][1] > 1 else True

f("creation video software") # return True, good sentence
f("creation creation creation") # return False, bad sentence
f("creation software creation") # return False, bad sentence