Search code examples
nltkstanford-nlpn-gram

How CFG and google n-gram can be combined to generate sentences


I have valid list of grammars and lexical items for generating grammatical correct phrases yet meaningless. I want to combine google n-gram to generate only the valid sentences. Is it feasible, is there any paper on this. I am using NLTK and Stanford core nlp tools.


Solution

  • No, it is not feasible. Real sentences have structure and meaning dependencies that go well beyond what can be captured in ngrams.

    I suppose you're thinking of generating a random structure by expanding your CFG, then using ngrams to select among the possible vocabulary choices. It's a pretty simple thing to code: Chop off your grammar at the part-of-speech level, generate a "sentence" with your CFG as a string of POS tags, and use the ngrams to fill them out one by one.

    To work with google's entire 5-gram collection you'll need a lot of disk space and a huge amount of RAM or some clever programming, so I recommend you experiment with one of the NLTK's tagged corpora (e.g., the Brown corpus with the "universal" tagset). Starting from any text, it is not hard to collect its ngrams, write a random text generator, and confirm that it produces semi-cohesive but undeniably incoherent (and still mostly ungrammatical) nonsense.