Search code examples
pythonnltkcontext-free-grammarlinguistics

NLTK Generate sentences without two occurences of the same word in Python


I am using this NLTK code to generate sentences from demo_grammar (see below), and the problem is that with grammar rules like N N or N N N I end up with sentences like "creation creation creation". I am only interested in generating sentences where the same word doesn't occur twice (i.e. creation video software).

How could I do that?

The generate.py from NLTK is as such: https://github.com/nltk/nltk/blob/develop/nltk/parse/generate.py

I have tried the demo code from the generate.py:

from nltk.grammar import CFG
from nltk.parse import generate    

demo_grammar = """
  S -> NP VP
  NP -> Det N
  PP -> P NP
  VP -> 'slept' | 'saw' NP | 'walked' PP
  Det -> 'the' | 'a'
  N -> 'man' | 'park' | 'dog'
  P -> 'in' | 'with'
"""

def demo(N=23):

    print('Generating the first %d sentences for demo grammar:' % (N,))
    print(demo_grammar)
    grammar = CFG.fromstring(demo_grammar)
    for n, sent in enumerate(generate(grammar, n=N), 1):
        print('%3d. %s' % (n, ' '.join(sent)))

Solution

  • You can rewrite the grammar as suggested by alexis, this means several list of terms (nouns, verbs,...) for a specific place in each sentence.

    But you can also apply a post-filtering strategy (don't have to touch grammar) :

    • generate all possible sentences with your grammar, even sentences with words occuring twice or more
    • apply a filter that removes all sentences with words occuring twice or more

    Here is the filter you can apply :

    from collections import Counter
    f=lambda sent:False if Counter(sent.split(" ")).most_common(1)[0][1] > 1 else True
    
    f("creation video software") # return True, good sentence
    f("creation creation creation") # return False, bad sentence
    f("creation software creation") # return False, bad sentence